ABSTRACT

Title of dissertation: TRACE OBLIVIOUS PROGRAM EXECUTION
Chang Liu, Doctor of Philosophy, 2016

Dissertation directed by: Professor Michael Hicks
Department of Computer Science, University of Maryland
and Professor Elaine Shi
Department of Computer Science, Cornell University

The big data era has dramatically transformed our lives; however, security incidents such as data breaches can put sensitive data (e.g. photos, identities, genomes) at risk. To protect users’ data privacy, there is a growing interest in building secure cloud computing systems, which keep sensitive data inputs hidden, even from computation providers. Conceptually, secure cloud computing systems leverage cryptographic techniques (e.g., secure multiparty computation) and trusted hardware (e.g. secure processors) to instantiate a secure abstract machine consisting of a CPU and encrypted memory, so that an adversary cannot learn information through either the computation within the CPU or the data in the memory. Unfortunately, evidence has shown that side channels (e.g. memory accesses, timing, and termination) in such a secure abstract machine may potentially leak highly sensitive information, including cryptographic keys that form the root of trust for the secure systems.

This thesis broadly expands the investigation of a research direction called
trace oblivious computation, where programming language techniques are employed to prevent side channel information leakage. We demonstrate the feasibility of trace oblivious computation, by formalizing and building several systems, including GhostRider, which is a hardware-software co-design to provide a hardware-based trace oblivious computing solution, SCVM, which is an automatic RAM-model secure computation system, and ObliVM, which is a programming framework to facilitate programmers to develop applications. All of these systems enjoy formal security guarantees while demonstrating a better performance than prior systems, by one to several orders of magnitude.
TRACE OBLIVIOUS PROGRAM EXECUTION

by

Chang Liu

Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2016

Advisory Committee:
Professor Michael W. Hicks, Co-Chair/Co-Advisor
Professor Elaine Shi, Co-Chair/Co-Advisor
Professor Charalampos Papamanthou
Professor Zia Khan
Professor Lawrence C. Washington
Dedication

To my loving mother.
Acknowledgments

I am very fortunate to have spent three fantastic years at UMD. Throughout my journey to learn how to do research, many people have helped me. This dissertation would not have been possible without them.

First, I want to thank my advisors, Elaine Shi and Michael Hicks, for their guidance to this amazing academic world. Through collaborating with them, Elaine and Mike have set up examples for me with both precept and practice. I not only benefited from insightful technical discussions with them, but also learnt a lot about other aspects of research: how to pick an important problem to work on, how to criticize a work, especially the one from myself, and so on.

I have learnt from many people that a piece of research is an execution of a great taste: I have learnt from Elaine a great taste of choosing research problems and pushing them closer to perfection. I have learnt from Mike a great taste to develop elegant theories to explain phenomenon and to communicate the ideas precisely and concisely. I also have learnt from Jon Froehlich, my proposal committee member, a great taste of giving presentation, and I have learnt from Dawn Song, who I worked closely in my last year, a great taste of envisioning novel research directions. I am grateful that I can work with these fantastic researchers during my PhD life.

My committee members – my advisor, Babis Papamanthou, Zia Khan, and Larry Washington – have been extremely supportive. My thanks also go to all other members of the MC2 faculty. I am also fortunate to be surrounded by a diverse and inspiring group of (former and current) students at MC2.
I am fortunate to work with my collaborators. Many ideas in this dissertation cannot be executed so well without their help. They are my advisors, Xiao Wang, Natik Kayak, Yan Huang, Martin Maas, Austin Harris, Mohit Tiwari and Jonathan Katz without an order.

Now, I would like to thank some dear friends. Xi Chen, Yuening Hu, Ke Zhai, He He, and I have spent a wonderful summer at Microsoft Research at Redmond, which is the most enjoyable summer during my PhD life. I am fortunate to be friends of Yulu Wang and Shangfu Peng, who made my early graduate life happy and substantial. Qian Wu and Hang Hu are my old friends who always support me on both life and research. Xi Yi and Yu Zhang brought both happiness and insightful suggestions into my life. I am happy to become the introducer of the couple to meet each other.

Life at UMD would have been much more difficult without members of the administrative and technical staff. Jennifer Story has always been patient and helpful beyond the call of duty. Joe Webster has made resources available just as we needed it. Jodie Gray and Sharron McElroy have minimized bureaucracy when it comes to tax and money. Thank you all.

Last but in no way the least, I thank my family who have given me unconditional love throughout my life. My mother, to whom I owe a lot due to my academic career, has been supporting me altruistically for me to pursue my dream. Words are not enough to express my gratitude for them. Finally, I thank Danqi Hu, who has always been there for me as an inspiring partner. Thank you for steering me to a better self.
# Table of Contents

## List of Figures

<table>
<thead>
<tr>
<th>Figure</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>viii</td>
<td></td>
</tr>
</tbody>
</table>

## 1 Introduction

1.1 Beyond Oblivious RAM .................................. 2
1.2 A hardware-software co-design for ensuring memory trace obliviousness 4
1.3 Automatic RAM-model secure computation ..................... 5
1.4 A programming framework for secure computation ................. 7
1.5 Our Results and Contributions. ................................ 8

## 2 Memory Trace Obliviousness: Basic Setting

2.1 Threat Model ............................................. 12
2.2 Motivating Examples ...................................... 13
2.3 Approach Overview ....................................... 14
2.4 Memory trace obliviousness by typing ......................... 16
  2.4.1 Operational semantics ................................. 17
  2.4.2 Memory trace obliviousness ........................... 21
  2.4.3 Security typing ...................................... 22
  2.4.4 Examples .............................................. 27
2.5 Compilation ................................................ 29
  2.5.1 Type checking source programs ......................... 30
  2.5.2 Allocating variables to ORAM banks .................... 32
  2.5.3 Inserting padding instructions ........................ 33
2.6 Evaluation .................................................. 36
  2.6.1 Simulation Results .................................... 37
2.7 Conclusion Remarks ........................................ 38

## 3 GhostRider: A Compiler-Hardware Approach

3.1 Introduction ............................................... 39
  3.1.1 Our Results and Contributions. ........................ 40
3.2 Architecture and Approach .................................. 42
  3.2.1 Motivating example .................................... 42
  3.2.2 Threat model ............................................ 43
List of Figures

2.1 Language syntax of $\mathcal{L}_{\text{basic}}$ .............................................. 17
2.2 Auxiliary syntax and functions for semantics in $\mathcal{L}_{\text{basic}}$ .................... 17
2.3 Operational semantics of $\mathcal{L}_{\text{basic}}$ ............................................ 18
2.4 Trace equivalence in $\mathcal{L}_{\text{basic}}$ .................................................. 22
2.5 Auxiliary syntax and functions for typing in $\mathcal{L}_{\text{basic}}$ .......................... 23
2.6 Typing for $\mathcal{L}_{\text{basic}}$ .................................................................. 24
2.7 Trace pattern equivalence in $\mathcal{L}_{\text{basic}}$ ............................................. 27
2.8 Finding a short padding sequence using the greatest common subsequence algorithm. An example with two abstract traces $T_t = [T_1; T_2; T_3; T_4; T_5]$ and $T_f = [T_1; T_3; T_2; T_4]$. One greatest common subsequence as shown is $[T_1; T_2; T_4]$. A shortest common super-sequence of the two traces is $T_{tf} = [T_1; T_3; T_2; T_3; T_4; T_5]$. ........ 34
2.9 Simulation Results for Strawman, Opt 1, and Opt 2. ..................................... 37

3.1 Motivating source program of GhostRider. .................................................. 43
3.2 GhostRider architecture. ............................................................................. 44
3.3 Syntax for $\mathcal{L}_{\text{GhostRider}}$ language, comprising (1) \texttt{ldb} and \texttt{stb} instructions that move data blocks between scratchpad and a specific ERAM or ORAM bank, and (2) scratchpad-to-register moves and standard RISC instructions. ................................................................. 47
3.4 $\mathcal{L}_{\text{GhostRider}}$ code implementing (part of) Figure 3.1 ......................... 49
3.5 Symbolic values, labels, auxiliary judgments and functions ....................... 53
3.6 Trace patterns and their equivalence in $\mathcal{L}_{\text{GhostRider}}$ .......................... 56
3.7 Security Type System for $\mathcal{L}_{\text{GhostRider}}$ (Part 1) ................................. 57
3.8 Security Type System for $\mathcal{L}_{\text{GhostRider}}$ (Part 2) ................................. 60
3.9 Security Type System for $\mathcal{L}_{\text{GhostRider}}$ (Part 3) ................................. 62
3.10 Simulator-based execution time results of GhostRider. ............................... 73
3.11 Legends of Figure 3.10 ............................................................................. 74
3.12 FPGA based execution time results: Slowdown of \textbf{Baseline} and \textbf{Final} versions compared to non-secure version of the program. Note that unlike Figure 3.10, \textbf{Final} uses only a single ORAM bank and conflates ERAM and DRAM (cf. Section 3.6). .................. 76
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.1</td>
<td>Dijkstra’s shortest distance algorithm in source (Part)</td>
<td>90</td>
</tr>
<tr>
<td>4.2</td>
<td></td>
<td>91</td>
</tr>
<tr>
<td>4.3</td>
<td>Formal results in SCVM</td>
<td>93</td>
</tr>
<tr>
<td>4.4</td>
<td>Syntax of SCVM</td>
<td>95</td>
</tr>
<tr>
<td>4.5</td>
<td>Auxiliary syntax and functions for SCVM semantics</td>
<td>98</td>
</tr>
<tr>
<td>4.6</td>
<td>Operational semantics for expressions in SCVM</td>
<td>101</td>
</tr>
<tr>
<td>4.7</td>
<td>Operational semantics for statements in SCVM (Part 1)</td>
<td>102</td>
</tr>
<tr>
<td>4.8</td>
<td>Operational semantics for statements in SCVM (Part 2)</td>
<td>103</td>
</tr>
<tr>
<td>4.9</td>
<td>Type System for SCVM</td>
<td>107</td>
</tr>
<tr>
<td>4.10</td>
<td>SCVM vs. automated circuit-based approach (Binary Search)</td>
<td>118</td>
</tr>
<tr>
<td>4.11</td>
<td>SCVM vs. hand-constructed linear scan circuit (Binary Search)</td>
<td>119</td>
</tr>
<tr>
<td>4.12</td>
<td>Heap insertion in SCVM</td>
<td>121</td>
</tr>
<tr>
<td>4.13</td>
<td>Heap extraction in SCVM</td>
<td>122</td>
</tr>
<tr>
<td>4.14</td>
<td>KMP string matching for median n (fixing m = 50) in SCVM</td>
<td>123</td>
</tr>
<tr>
<td>4.15</td>
<td>KMP string matching for large n (fixing m = 50) in SCVM</td>
<td>124</td>
</tr>
<tr>
<td>4.16</td>
<td>Dijkstra’s shortest-path algorithm’s performance in SCVM</td>
<td>125</td>
</tr>
<tr>
<td>4.17</td>
<td>Dijkstra’s shortest-path algorithm’s speedup of SCVM</td>
<td>125</td>
</tr>
<tr>
<td>4.18</td>
<td>Aggregation over sliding windows’s performance in SCVM</td>
<td>126</td>
</tr>
<tr>
<td>4.19</td>
<td>Aggregation over sliding windows’s speedup in SCVM</td>
<td>126</td>
</tr>
<tr>
<td>4.20</td>
<td>SCVM’s Savings by memory-trace obliviousness optimization (inverse permutation). the non-linearity (around 60) of the curve is due to the increase of the ORAM recursion level at that point.</td>
<td>127</td>
</tr>
<tr>
<td>4.21</td>
<td>Savings by memory-trace obliviousness optimization (Dijkstra)</td>
<td>127</td>
</tr>
<tr>
<td>5.1</td>
<td>Streaming MapReduce in ObliVM-Lang. See Section 5.3.1 for oblivious algorithms for the streaming MapReduce paradigm [36].</td>
<td>147</td>
</tr>
<tr>
<td>5.2</td>
<td>Oblivious stack by non-specialist programmers.</td>
<td>151</td>
</tr>
<tr>
<td>5.3</td>
<td>Code by expert programmers to help non-specialists implement oblivious stack.</td>
<td>152</td>
</tr>
<tr>
<td>5.4</td>
<td>Loop coalescing. The outer loop will be executed at most n times in total, the inner loop will be executed at most m times in total – over all iterations of the outer loop. A naive approach compiler would pad the outer and inner loop to n and m respectively, incurring O(nm) cost. Our loop coalescing technique achieves O(n + m) cost instead.</td>
<td>154</td>
</tr>
<tr>
<td>5.5</td>
<td>Karatsuba multiplication in ObliVM-Lang.</td>
<td>159</td>
</tr>
<tr>
<td>5.6</td>
<td>Part of our Circuit ORAM implementation (Type Definition) in ObliVM-Lang.</td>
<td>160</td>
</tr>
<tr>
<td>5.7</td>
<td>Part of our Circuit ORAM implementation (ReadAndRemove) in ObliVM-Lang.</td>
<td>161</td>
</tr>
<tr>
<td>5.8</td>
<td>Sources of speedup in comparison with state-of-the-art in 2012 [44]: an in-depth look.</td>
<td>166</td>
</tr>
<tr>
<td>B.1</td>
<td>Well formedness judgments for proof of Memory-Trace Oblivousness of L$_{\text{GhostRider}}$</td>
<td>203</td>
</tr>
<tr>
<td>B.2</td>
<td>Symbolic Execution in L$_{\text{GhostRider}}$</td>
<td>218</td>
</tr>
</tbody>
</table>
Chapter 1: Introduction

Cloud computing allows users to outsource both data and computation to third-party cloud providers, and promises numerous benefits such as economies of scale, easy maintenance, and ubiquitous availability. These benefits, however, come at the cost of giving up physical control of one’s computing infrastructure and private data. Privacy concerns have held back government agencies and businesses alike from outsourcing their computing infrastructure to the public cloud [18,94].

To protect users’ data privacy, there is a growing trend to build secure cloud computing systems, which enable computation over two or more parties’ sensitive data, while revealing nothing more than the results to the participating parties, and nothing at all to any third party providers Conceptually, secure cloud computing systems leverage cryptographic techniques (e.g. secure multiparty computation) and trusted hardware (e.g. secure processors) to instantiate a “secure” abstract machine consisting of a CPU and encrypted memory, so that an adversary cannot learn information through either the computation within the CPU or the data in the memory. Unfortunately, evidence has shown that side channels (e.g., memory accesses, timing, and termination) in such a “secure” abstract machine may potentially leak highly sensitive information including cryptographic keys that form the
root of trust for the secure systems.

The thesis of this work is that programming language-based techniques—notably compilers and type systems—can be used to make programs secure, despite powerful adversaries with a fine-grained view of execution, by enforcing the property of trace obliviousness, which we define in this thesis.

1.1 Beyond Oblivious RAM

To cryptographically obfuscate memory access patterns, one can employ Oblivious RAM (ORAM) [33, 35], a cryptographic construction that makes memory address traces computationally indistinguishable from a random address trace. Encouragingly, recent theoretical breakthroughs [80, 84, 90] have allowed ORAM memory controllers to be built [28, 63]—these turn DRAM into oblivious, block-addressable memory banks.

The simplest way to deploy ORAM is to implement a single, large ORAM bank that contains all the code and data (assuming that the client can use standard PKI to safely transmit the code and data to a remote secure processor). A major drawback of this baseline approach is efficiency: every single memory-block access incurs the ORAM penalty which is roughly (poly-)logarithmic in the size of the ORAM [35, 80]. In practice, this translates to almost 100× additional bandwidth that, even with optimizations, incurs a ∼10× latency cost per block [28, 63]. Another issue is that, absent any padding, the baseline approach reveals the total number of memory accesses made by a program, which can leak information about the secret
In practice, we observe that many programs are intrinsically oblivious, meaning that their execution traces observed by an adversary do not leak information. For example, let us consider a program to compute the summation of an array of integers, which are to be hidden from the adversary. This program will sequentially scan through the entire array, and assuming the array is encrypted in the memory, its memory access pattern will not leak any information about the secret integers stored in the array. In this case, ORAM is not necessary.

To enable such optimizations is not trivial. A program may not always be oblivious. For example, a program may mistakenly allocate an array whose access pattern indeed leaks sensitive information outside any ORAM banks. We take the approach to formalize the property of memory trace obliviousness (MTO), and design novel type systems to enforce that a well-typed program enjoys memory trace obliviousness.

A type system is a static analysis method to enforce a well-typed program to satisfy certain properties. Intuitively, the type systems we develop are an extension of language-based information flow systems [79], which is used to keep track of whether or not a variable used by a program contains sensitive information. Our extension further keeps track of whether the memory access patterns produced by a program’s execution will leak information to the adversary.

In the following, we will discuss two main solutions to achieve secure cloud computing, i.e., secure-processor-based solution and secure-computation-based solution.
1.2 A hardware-software co-design for ensuring memory trace obliviousness

To protect the confidentiality of sensitive data in the cloud, thwarting software attacks alone is necessary but not sufficient. An attacker with physical access to the computing platform (e.g., an malicious insider or intruder) can launch various *physical attacks*, such as tapping memory buses, plugging in malicious peripherals, or using cold-(re)boots [40,81]. Such physical attacks can uncover secrets even when the software stack is provably secure.

A secure processor enables memory encryption [32,56,85–87] to hide the contents of memory from direct inspection, but an adversary can still observe memory addresses transmitted over the memory bus. As argued above, however, the memory address trace is a side channel that can leak sensitive information. We thus exploit the above idea to enable memory obfuscation within the secure processor, and employ programming language techniques to optimize program’s execution.

Deploying the above idea to build a memory trace oblivious system in realistic hardware architectures is non-trivial. First, as explained above, the entire memory needs be split into regions of ORAM banks, whose access addresses are obfuscated, versus encrypted RAM, whose access addresses are not. Second, cache behaviors can break a program’s memory trace oblivious execution, and it is hard to be tracked statically. The compiler needs hardware support to deterministically control the cache behavior. Third, the type system needs to deal with assembly code directly,
since otherwise the MTO property may be broken during compilation.

To tackle these problems, we present a hardware-software co-design called GhostRider. GhostRider’s hardware architecture supports both an encrypted RAM region, and multiple ORAM banks, which can be leveraged by programs to optimize the performance. GhostRider compiler can translate a program written in a C-like high-level language into assembly code using GhostRider’s instruction set architecture (ISA). GhostRider provides a type checker over the assembly code to enforce a well-typed program enjoys MTO. We will explain GhostRider in detail in Chapter 3.

1.3 Automatic RAM-model secure computation

An alternative route to achieve secure cloud computing is through secure computation. Secure computation is a cryptographic technique that allows mutually distrusting parties to make collaborative use of their local data without harming privacy of their individual inputs. Since Yao’s seminal paper [96], research on secure two-party computation—especially in the semi-honest model we consider here—has flourished, resulting in ever more efficient protocols [11, 38, 52, 97] as well as several practical implementations [16, 43, 45, 46, 54, 65]. Since the first system for general-purpose secure two-party computation was built in 2004 [65], efficiency has improved substantially [11, 46].

Almost all previous implementations of general-purpose secure computation assume the underlying computation is represented as a circuit. While theoretical developments using circuits are sensible (and common), compiling typical programs,
which assume a von Neumann-style Random Access Machine (RAM) model, to
efficient circuits can be challenging. One significant challenge is handling *dynamic memory accesses* to an array in which the memory location being read/written depends on secret inputs. A typical program-to-circuit compiler typically makes an entire copy of the array upon every dynamic memory access, thus resulting in a huge circuit when the data size is large. Theoretically speaking, generic approaches for translating RAM programs into circuits incur, in general, $O(TN)$ blowup in efficiency, where $T$ is an upper bound on the program’s running time, and $N$ is the memory size.

To address these limitations, researchers have more recently considered secure computation that works directly in the RAM model [38,62]. The key insight is, to rely on ORAM to enable dynamic memory access with poly-logarithmic cost, while preventing information leakage through memory-access patterns. Gordon et al. [38] observed a significant advantage of RAM-model secure computation (RAM-SC) in the setting of *repeated sublinear-time queries* (e.g., binary search) on a large database. By amortizing the setup cost over many queries, RAM-SC can achieve *amortized* cost asymptotically close to the run-time of the underlying program in the insecure setting.

To enable secure computation with practical usage, it is ideal to have a complete system, so that developers can implement the applications in a high-level language (which are mostly in RAM-model) rather than implementing circuits directly, and the compiler can translate the program into an efficient secure computation protocol while enforcing security.
While pursuing this goal, the MTO approach that we developed for GhostRider can be leveraged in this setting as well. In particular, for a program, we can use the same kind of MTO analysis approach to decide whether or not we should store some data in an ORAM bank, and ensure that a program is MTO using a type system. Secure computation, however, imposes more security restrictions to be considered. For example, a secure computation requires the circuit to be evaluated by both parties. This means that both parties know the instruction being executed. Therefore, secure computation requires instruction trace obliviousness beyond memory trace obliviousness. Further, since ORAM protocols are implemented in circuits in secure computation, this allows programmers to make non-blackbox usage of ORAMs as well. For example, developers can use tree-based non-recursive ORAM [90], which is a less expensive building block of ORAM, directly to achieve better performance.

To address these issues, we developed the SCVM system which includes a SCVM intermediate representation, a compiler, and a secure type system to demonstrate how to achieve automatic efficient RAM-model computation. We detail SCVM in Chapter 4.

1.4 A programming framework for secure computation

As a last contribution, we also deliver the ObliVM system as an extension on top of SCVM. It provides a programming framework with more expressive power and easy-programmability to help developers write better algorithms more easily. ObliVM focuses more on how to facilitate developers to build secure computation
applications. The design goal is to allow both cryptographic experts to improve the
efficiency of low-level cryptographic protocols, and application developers who may
not be familiar with cryptography to implement efficient applications in a high-
level language. To meet these goals, we designed a new high-level programming
language, called ObliVM-Lang, as an extension to SCVM and implement a compiler.
Using this language, we can implement programming abstractions, which enable ap-
plication developers to implement algorithms in an easy and efficient way. ObliVM
also provides a backend called ObliVM-SC to allow cryptographic experts to imple-
ment different protocols to further accelerate the execution of the whole system. We
will explain ObliVM in Chapter 5.

1.5 Our Results and Contributions.

In this thesis, we propose the trace oblivious computation theory and bring
it to practice. We design and build GhostRider, a hardware/software platform for
provably secure, memory-trace oblivious program execution, which can compile pro-
grams to a realistic architecture while formally ensuring MTO. We also design and
build SCVM and ObliVM to enable developers to develop efficient secure computation
applications. In summary, our contributions are:

**Trace obliviousness theory.** We greatly extend the study of trace oblivious
program execution by providing a theory to establish when if a program is trace
oblivious. Particularly, we demonstrate the way how to formalize a language such
that the adversary-observable execution traces generated by the program can be
modeled. We also demonstrate how type systems can be used to enforce the trace obliviousness of programs in these languages. On the one hand, using these type systems, we extend the set of oblivious programs that can be verified automatically, to include more efficient implementations. On the other hand, these type systems can be extended to analyze other side-channel leakages that can be expressed as traces, and thus used to defend against attacks leveraging these channels.

**GhostRider system.** GhostRider is the first system to bring trace oblivious computation theory to practice. By building GhostRider itself, we build the first memory-trace obliviousness compiler that emits target code for a realistic ORAM-capable processor architecture. GhostRider’s compiler optimizes the generated assembly code while ensuring the optimized code still satisfies MTO. To enable these optimizations, GhostRider builds on the Phantom processor architecture [63] but exposes new features and knobs to the software. Our empirical results on a real processor demonstrate the feasibility of our architecture and show that compared to the baseline approach of placing everything in a single ORAM bank, our compile-time static analysis achieves up to nearly an order-of-magnitude speedup for many common programs.

**SCVM system.** We build SCVM as the first system to enable automatic RAM-model secure computation. SCVM provides a complete system that takes a program written in a high-level language and compiles it to a protocol for secure two-party computation of that program. To achieve this goal, SCVM provides an intermediate representation, a type system to ensure any well-typed program will generate a secure computation protocol secure in the semi-honest model, and a compiler
to transform a program written in a high-level language into a secure two-party computation protocol while integrating compile-time optimizations crucial for improving performance. Our evaluation shows a speedup of 1–2 orders of magnitude as compared to standard circuit-based approaches for securely computing the same programs.

**ObliVM system.** Building on top of SCVM, we design and implement ObliVM, which focuses more on richer expressive power and easy programmability while achieving state-of-the-art performance for secure computation. ObliVM provides an expressive programming language to allow both cryptographic experts and non-experts to use ObliVM to customize both front-end and back-end optimizations very easily. Using this programming language, we implement several programming abstractions in ObliVM to help developers design and implement new efficient oblivious algorithms. Experiments show that the automatically generated circuits incurs only 0.5% to 2% overhead over manually optimized implementations.
Chapter 2: Memory Trace Obliviousness: Basic Setting

In this chapter, we discuss a simple setting for memory trace oblivious program execution, which still captures the essential ideas. We consider a simple client-server scenario. A program is run on the client, but the data operated by the program is stored on the server. The goal is to enforce the server cannot learn any sensitive information from both the data stored on the server, and its interaction with the client-side program. We assume the server manages the data as a big memory, so that the only way a client program can interact with the server is through memory read/write APIs. Particularly, through a read call `read(i)` will request the data block indexed by `i` from the server, and a write call `write(i, data)` will update the data block indexed by `i` with `data`. The server thus can observe `i` and `data` during such interactions.

This chapter will explain how can we use ORAM to prevent the server to learn any information through these interactions, and how a compiler and a type system can help enforcing this security property over a program automatically while preserving efficiency.

Before we go into the details, we want to emphasize that although this setting itself has interesting applications in this setting, it can be extended to more appli-
cations such as secure processor applications, where CPU and RAM correspond to client and server respectively, and secure multi-party computation. We will discuss these extensions in Chapter 3, 4, 5. We want to explain the basic ideas and techniques to achieve trace oblivious computation in this chapter, which will be further extended in later chapters.

This chapter is based on a paper that I co-authored with Michael Hicks and Elaine Shi [58]. I developed the formalism and conducted the proof of the main memory trace obliviousness theorem under the help of Michael Hicks, and provided preliminary experimental results on performance.

2.1 Threat Model

Particularly, a server stores all the data, and a client runs programs interacting with the server to get data. We assume that the client has small local storage. The server manages the data as a random access memory (RAM), which supports read and write operation with random addresses. The adversarial model assumes that the server can observe all addresses that the client program is accessing, but not client programs’ internal states. We assume the data stored on the server are all encrypted, so that the server cannot observe the data directly. Client program will decrypt the data once retrieved, and re-encrypt them before uploading to the server (via write operations). We consider the honest-but-curious model, such that the server does not modify the stored data. Orthogonal techniques, such as Merkle’s hash tree, can be used to enforce data integrity.
Though they are a real threat, timing and other covert channels are not considered in this basic setting. Later, we will show how our GhostRider system (Chapter 3) and ObliVM (Chapter 5) can prevent leakage through these channels.

2.2 Motivating Examples

Let us consider the following program, where the client does not have enough space to store a big dataset, but offloads it to the server instead.

1: int findmax(public int n, secret int* data) {
    secret int max = data[0];
    for (public int i=1; i<n; ++i) {
        if (data[i] > max)
            max = data[i]
    }
}

In this client program, data refers to the client’s data stored on the server. The keyword secret is used to denote this data is sensitive. A straightforward idea to enforce that the server does not learn information through addresses is that the client program can run an ORAM protocol with the server and stores all data in a giant ORAM. This approach, however, has two drawbacks. First, this may not secure, because the total number of memory accesses may still leak the information. For example, in the above example, based on the total number of accesses, the server will learn how many times line 5 is executed. The server can infer about
some information such as whether \texttt{data} is in ascending order.

To mitigate this problem, the client program needs perform dummy accesses to the server even though the condition at line 4 is not satisfied. To this aim, this program needs be rewritten such that in the false-branch of line 4, a dummy statement \texttt{dummy\_max = data[i]} is inserted, where \texttt{dummy\_max} is an inserted dummy variable which has no side-effect to other data such as \texttt{max} and \texttt{data}.

Second, this is not efficient, as it will incur an ORAM overhead which is unnecessary. In particular, This program sequentially scans through the entire \texttt{data} array, and keeps the maximal value in \texttt{max}. In this case, ORAM is not necessary, since the data access addresses, i.e. \texttt{i}, are public information which can be inferred by the server without knowing any information about the secret data.

2.3 Approach Overview

In the following, we present several technical highlights in the following.

\textbf{Memory Trace Obliviousness.} The adversary can observe the stream of accesses to server, even if he cannot observe the content of those accesses, and such observations are sufficient to infer secret information. To eliminate this channel of information, we need a way to run the program so that the event stream does not depend on the secret data—no matter the values of the secret, the observable events will be the same. Programs that exhibit this behavior enjoy a property we call \textit{memory trace obliviousness}.

\textbf{Padding.} Toward ensuring memory trace obliviousness, the compiler can add
padding instructions to either or both branches of if statements whose guards reference secret information (we refer to such guards as high guards). This idea is similar to inserting padding to ensure uniform timing [6, 10, 22, 42]. We need to insert dummy accesses to the two branches such that both branches have equivalent access patterns. We will detail our approach in Section 2.5.

**ORAM for secret data.** We store secret data in Oblivious RAM (ORAM), and extend our trusted computing base with an client side ORAM library. This library will encrypt/decrypt the secret data and maintain a mapping between addresses for variables used by the program and actual storage addresses for those variables on server. For each read/write issued for a secret address, the ORAM library will issue a series of reads/writes to the server with actual addresses, which has the effect of hiding which of the accesses was the real address. Moreover, with each access, the ORAM library will shuffle program/storage address mappings so that the physical location of any program variable is constantly in flux. Asymptotically, ORAM accesses are polylogarithmic blowup in the size of the ORAM [35]. Note that if we were concerned about integrity, we could compose the ORAM library with machinery for, say, authenticated accesses.

**Multiple ORAM banks.** ORAM can be an order of magnitude slower than regular DRAM [83]. Moreover, larger ORAM banks containing more variables incur higher overhead than smaller ORAM banks [35, 80]; as mentioned above, ORAM accesses are asymptotically related to the size of the ORAM. Thus we can reduce run-time overhead by allocating code/data in multiple, smaller ORAM banks rather than all of it in a single, large bank.
**Arrays.** Implicitly we have assumed that all of an array is allocated to the same ORAM bank, but this need not be the case. Indeed, for our example it is safe to simply encrypt the contents of `data[i]` because knowing which memory address we are accessing does not happen to reveal anything about the contents of `data[]`, as we have explained before.

If we allocate each array element in a separate ORAM bank, the running time of the program becomes roughly $2n$ accesses: each access to `data[i]` is in a bank of size 1. Each iteration will read `data` twice, and thus there are $2n$ accesses in total for $n$ iterations.

In comparison, the naïve strategy of allocating all variables in a single ORAM bank would incur $2n \cdot \text{poly log}(n + 2)$ memory accesses (for secret variables), since each access to an ORAM bank of size $m$ requires $O(\text{poly log}(m))$ actual server accesses. This shows that we can achieve asymptotic gains in performance for some programs.

### 2.4 Memory trace obliviousness by typing

This section formalizes a type system for verifying that programs enjoy memory trace obliviousness. In the next section we describe a compiler to transform programs like the one in Section 2.2 so they can be verified by our type system.

We formalize our type system using a simple language $\mathcal{L}_{\text{basic}}$ presented in Figure 2.1. A program is a statement $S$ which can be a sequence $S; S$, no-op `skip`, assignments to variables and arrays, conditionals, and loops. Expressions $e$ consist
of constant natural numbers, variable and array reads, and (compound) operations. For simplicity, arrays may contain only integers (and not other arrays), and bulk assignments between arrays (i.e., $x := y$ when $y$ is an array) are not permitted.

### 2.4.1 Operational semantics

We define a big-step operational semantics for $\mathcal{L}_{\text{basic}}$ in Figure 2.3, which refers to auxiliary functions and syntax defined in Figure 2.2. Big-step semantics is simpler than the small-step alternative, and though it cannot be used to reason about non-terminating programs, our cloud computing scenario generally assumes that
Figure 2.3: Operational semantics of $\mathcal{L}_{basic}$
programs terminate. The main judgment of the former figure, \( \langle M, S \rangle \downarrow_{t} M' \) (shown at the bottom), indicates that program \( S \) when run under memory \( M \) will terminate with new memory \( M' \) and in the process produce a memory access trace \( t \). We also define judgments \( \langle M, e \rangle \downarrow_{t} n \) for evaluating statements and expressions, respectively.

We model server side storage as memory \( M \), which is a partial functions from variables to labeled values, where a value is either an array \( m \) or a number \( n \), and a label is either \( L \) or an ORAM bank identifier \( o \). Thus we can think of an ORAM bank (managed by the client side ORAM library) \( o \) containing all data for variables \( x \) such that \( M(x) = (\_, o) \), whereas all data labeled \( L \) is stored on the server directly. We model an array \( m \) as a partial function from natural numbers to natural numbers. We write \(|m|\) to model the length of the array; that is, if \(|m| = n\) then \( m(i) \) is defined for \( 0 \leq i < n \) but nothing else. To keep the formalism simple, we assume all of the data in an array is stored in the same place, i.e., all on the server directly or all in the same ORAM bank. We also assume data referred by non-array variables are also stored on the server. We will show how these assumptions can be relaxed while describing GhostRider (Chapter 3) and ObliVM (Chapter 3).

A memory access trace \( t \) is a finite sequence of events arising during program execution that are observable to the server. These events include read events \( \text{read}(x, n) \) which states that number \( n \) was read from variable \( x \) and \( \text{read}(x, n_1, n_2) \), which states number \( n_2 \) was read from \( x[n_1] \). The corresponding events for writes to variables and arrays are similar, but refer to the number written, rather than read. Event \( o \) indicates an access to ORAM—only the storage bank \( o \) is discernable, not
the precise variable involved or even whether the access is a read or a write. (Each ORAM read/write in the program translates to several actual DRAM accesses, but we model them as a single abstract event.) Finally, $t@t$ represents the concatenation of two traces and $\epsilon$ is the empty trace.

The rules in Figure 2.3 are largely straightforward. Rule (E-Var) defines variable reads by looking up the variable in memory, and then emitting an event consonant with the label on the variable’s memory. This is done using the $evt$ function defined in Figure 2.2: if the label is some ORAM bank $o$ then event $o$ will be emitted, otherwise event $\text{read}(x, n)$ is emitted since the access is to the server directly.

The semantics treats array accesses as “oblivious” to avoid information leakage due to out-of-bounds indexes. In particular, rule (E-Arr) indexes the array using auxiliary function $\text{get}$, also defined in Figure 2.2, that returns 0 if the index $n$ is out of bounds. Rule (S-AAsn) uses the $\text{upd}$ function similarly: if the write is out of bounds, then the array is not affected.\footnote{The syntax $m[n_1 \mapsto n_2]$ defines a partial function $m'$ such that $m'(n_1) = n_2$ and otherwise $m'(n) = m(n)$ when $n \neq n_1$. We use the same syntax for updating memories $M$.} We could have defined the semantics to throw an exception, or result in a stuck execution, but this would add unnecessary complication. Supposing we had such exceptions, our semantics models wrapping array reads and writes with a try-catch block that ignores the exception, which is a common pattern, e.g., in Jif [19, 49], and has also been advocated by Deng and Smith [26].

The rule (S-Cond) for conditionals is the obvious one; we write $\text{ite}(x, y, z)$ to denote $y$ when $x$ is 0, and $z$ otherwise. Rule (S-WhileT) expands the loop one
unrolling when the guard is true and evaluates that to the final memory, and rule
(S-WhileF) does nothing when the guard is false. Finally rule (P-Stmts) handles
sequences of statements.

2.4.2 Memory trace obliviousness

The security property of interest in our setting we call memory trace oblivious-
ness. This property generalizes the standard (termination-sensitive) noninterference
property to account for memory traces. Intuitively, a program satisfies memory trace
obliviousness if it will always generate the same trace (and the same final memory)
for the same adversary-visible memories, no matter the particular values stored in
ORAM. We formalize the property in three steps. First we define what it means
for two memories to be low-equivalent, which holds when they agree on memory
contents having label $L$.

**Definition 1** (Low equivalence). Two memories $M_1$ and $M_2$ are low-equivalent,
denoted as $M_1 \sim_L M_2$, if and only if $\forall x, v. M_1(x) = (v, L) \iff M_2(x) = (v, L)$.

Next, we define the notion of the $\Gamma$-validity of a memory $M$. Here, $\Gamma$ is the
type environment that maps variables to security types $\tau$, which are either Nat $l$
or Array $l$ (both are defined in Figure 2.5). In essence, $\Gamma$ indicates a mapping of
variables to memory banks, and if memory $M$ employs that mapping then it is valid
with respect to $\Gamma$.

**Definition 2** (Γ-validity). A memory $M$ is valid under a environment $\Gamma$, or $\Gamma$-valid,
\[ \epsilon @ t \equiv t @ \epsilon \equiv t \]

\[ \frac{t_1 = t_2}{t_1 @ t_2 \equiv t_1 @ t_2} \]

\[ \frac{t_1 \equiv t_2 \quad t_2 \equiv t_3}{t_1 \equiv t_3} \]

\[ \frac{t_1 \equiv t'_1 \quad t_2 \equiv t'_2}{t_1 @ t_2 \equiv t'_1 @ t'_2} \]

\[ (t_1 @ t_2) @ t_3 \equiv t_1 @ (t_2 @ t_3) \]

Figure 2.4: Trace equivalence in \( L_{\text{basic}} \)

if and only if, for all \( x \)

\[ \Gamma(x) = \text{Nat } l \iff \exists n \in \text{Nat}. M(x) = (n, l) \]

\[ \Gamma(x) = \text{Array } l \iff \exists m \in \text{Arrays}. M(x) = (m, l) \]

Finally, we define memory trace obliviousness. Intuitively, a program enjoys this property if all runs of the program on low-equivalent, \( \Gamma \)-valid memories will always produce the same trace and low-equivalent final memory.

**Definition 3** (Memory trace obliviousness). Given a security environment \( \Gamma \), a program \( S \) satisfies \( \Gamma \)-memory trace obliviousness if for any two \( \Gamma \)-valid memories \( M_1 \sim_L M_2 \), if \( \langle M_1, S \rangle \downarrow t_1, M'_1 \) and \( \langle M_2, S \rangle \downarrow t_2, M'_2 \), then \( t_1 \equiv t_2 \), and \( M'_1 \sim_L M'_2 \).

In this definition, we write \( t_1 \equiv t_2 \) to denote that \( t_1 \) and \( t_2 \) are equivalent. Equivalence is defined formally in Figure 2.4. Intuitively, two traces are equivalent if they are syntactically equivalent or we can apply associativity to transform one into the other. Furthermore, \( \epsilon \) plays the role of the identity element.

### 2.4.3 Security typing

Figure 2.6 presents a type system that aims to ensure memory trace obliviousness. Auxiliary definitions used in the type rules are given in Figure 2.5. This
Type system 

\[ \tau ::= \text{Nat} \mid \text{Array} l \]

Environments

\[ \Gamma \in \text{Vars} \rightarrow \text{Types} \]

Trace patterns

\[ T ::= \text{Read } x \mid \text{Write } x \mid \text{Readarr } x \mid \text{Writearr } x \mid \text{Loop}(T,T) \mid o \mid T@T \mid T + T \mid \epsilon \]

\[ l_1 \sqcup l_2 = \begin{cases} l_1 & \text{if } l_1 \neq L \\ l_2 & \text{otherwise} \end{cases} \]

\[ l_1 \sqsubseteq l_2 \text{ iff } \begin{cases} l_1 = L \text{ or } \\ l_2 \neq L \text{ and } l_2 \neq n \end{cases} \]

\[ \text{select}(T_1, T_2) = \begin{cases} T_1 & \text{if } T_1 \sim_L T_2 \\ T_1 + T_2 & \text{otherwise} \end{cases} \]

Figure 2.5: Auxiliary syntax and functions for typing in \( \mathcal{L}_\text{basic} \)

The typing judgment for expressions is written \( \Gamma \vdash e : \tau; T \), which states that in environment \( \Gamma \), expression \( e \) has type \( \tau \), and when evaluated will produce a trace described by the trace pattern \( T \). The judgments for statements \( s \) and programs \( S \) are similar. Trace patterns describe families of run-time traces; we write \( t \in T \) to say that trace \( t \) matches the trace pattern \( T \).

Trace pattern elements are quite similar to their trace counterparts: ORAM accesses are the same, as are empty traces and trace concatenation. Trace pattern
\[ \begin{align*}
\Gamma \vdash e : \tau; T \\
\text{T-Var} & \quad \Gamma(x) = \text{Nat } l \quad T = \text{evt}(l, \text{Read}(x)) \\
\Gamma \vdash x : \text{Nat } l; T \\
\text{T-Con} & \quad \Gamma \vdash n : \text{Nat } l; \epsilon \\
\text{T-Op} & \quad \Gamma \vdash e_1 : \text{Nat } l_1; T_1 \quad \Gamma \vdash e_2 : \text{Nat } l_2; T_2 \quad l = l_1 \sqcup l_2 \\
\Gamma \vdash e_1 \text{ op } e_2 : \text{Nat } l; T_1 \circ T_2 \\
\text{T-Arr} & \quad \Gamma(x) = \text{Array } l \quad \Gamma \vdash e : \text{Nat } l' ; T \\
T' &= \text{evt}(l, \text{Readarr}(x)) \\
\Gamma \vdash x[e] : \text{Nat } l \sqcup l'; T \circ T' \\
\text{T-Skip} & \quad \Gamma, l_0 \vdash \text{skip}; \epsilon \\
\text{T-Asn} & \quad \Gamma \vdash e : \text{Nat } l; T \\
\Gamma \vdash x : \text{Nat } l' ; T' \\
\Gamma, l_0 \vdash x := e; T @ \text{evt}(l', \text{Write}(x)) \\
\text{T-AAsn} & \quad \Gamma \vdash e_1 : \text{Nat } l_1; T_1 \quad \Gamma \vdash e_2 : \text{Nat } l_2; T_2 \quad \Gamma(x) = \text{Array } l \\
\Gamma \vdash x[e_1] := e_2; T_1 \circ T_2 \circ \text{evt}(l, \text{Writearr}(x)) \\
\text{T-Cond} & \quad \Gamma \vdash e : \text{Nat } l; T \\
l_0 \sqcup l_0 \neq L \Rightarrow T_1 \sim L \quad T' = \text{select}(T_1, T_2) \\
\Gamma, l_0 \vdash \text{if} (e, S_1, S_2); T @ T' \\
\text{T-While} & \quad \Gamma \vdash e : \text{Nat } l; T_1 \quad \Gamma, l_0 \vdash S; T_2 \\
l_0 \sqcup l_0 \subseteq L \\
\Gamma, l_0 \vdash \text{while} (e, S); \text{Loop}(T_1, T_2) \\
\text{T-Seq} & \quad \Gamma, l_0 \vdash S_1; T_1 \\
\Gamma, l_0 \vdash S_2; T_2 \\
\Gamma, l_0 \vdash S_1 \circ S_2; T_1 \circ T_2 \\
\end{align*} \]

Figure 2.6: Typing for \( \mathcal{L}_{\text{basic}} \)
events for reads and writes to variables and arrays are more abstract, mentioning the variable being read, and not the particular value (or index, in the case of arrays); we have \textbf{read}(x, n) \in \textbf{Read}(x) \text{ for all } n, \text{ for example. There is also the or-pattern } T_1 + T_2 \text{ which matches traces } t \text{ such that either } t \in T_1 \text{ or } t \in T_2. \text{ Finally, the trace pattern for loops, } \textbf{Loop}(T_1, T_2), \text{ denotes the set of patterns } T_1 \text{ and } T_1@T_2@T_1 \text{ and } T_1@T_2@T_1@T_2@T_1 \text{ and so on, and thus matches any trace that matches one of them.}

Turning to the rules, we can see that each one is structurally similar to the corresponding semantics rule. Each rule likewise uses the \textit{evt} function (Figure 2.2) to selectively generate an ORAM event \textit{o} or a basic event, depending on the label of the variable being read/written. Rule (T-Var) thus generates a \textbf{Read}(x) pattern if \textit{x}'s label is \textit{L}, or generates the ORAM event \textit{l} (where \textit{l} \neq \textit{L} implies \textit{l} is some bank \textit{o}). As expected, constants \textit{n} are labeled \textit{L} by (T-Con), and compound expressions are labeled with the join of the labels of the respective sub-expressions by (T-Op). Rule (T-Arr) is interesting in that we require \textit{l} \sqsubseteq \textit{l'}, where \textit{l} is the label of the index and \textit{l'} is the label of the array, but the label of the resulting expression is the join of the two. As such, we can have a public index of a secret array, but not the other way around. This is permitted because of our oblivious semantics: a public index reveals nothing about the length of the array when the returned result is secret, and no out-of-bounds exception is possible.

The judgment for statements \( \Gamma, l_0 \vdash S; T \) is similar to the judgment for expressions, but there is no final type, and it employs the standard \textit{program counter (PC)} label \( l_0 \) to prevent implicit flows. In particular, the (T-Asn) and (T-AAsn) rules both require that the join of the label \textit{l} of the expression on the rhs, when joined
with the program counter label $l_0$, must be lower than or equal to the label $l'$ of the
variable; with arrays, we must also join with the label $l_1$ of the indexing expression.
Rule (T-Cond) checks the statements $S_i$ under the program counter label that is
at least as high as the label of the guard. As such, coupled with the constraints
on assignments, any branch on a high-security expression will not leak information
about that expression via an assignment to a low-security variable. In a similar way,
rule (T-Lab) requires that the statement location $p$ is lower or equal to the program
counter label, so that a public instruction fetch cannot be the source of an implicit
flow.

Rule (T-Cond) also ensures that if the PC label or that of the guard expression
is secret, then the actual run-time trace of the true branch (matched by the trace
pattern $T_1$) and the false branch (pattern $T_2$) must be equal; if they were not, then
the difference would reveal something about the respective guard. We ensure run-
time traces will be equal by requiring the trace patterns $T_1$ and $T_2$ are equivalent,
as axiomatized in Figure 3.6. The first two rows prove that $\epsilon$ is the identity, that
$\sim_L$ is a transitive relation, and that concatenation is associative. The third row
unsurprisingly proves that ORAM events to the same bank and fetches of the same
location/bank are equivalent. More interestingly, the third row claims that public
reads to the same variable are equivalent. This makes sense given that public writes
are not equivalent. As such, reads in both branches will always return the same
run-time value they had prior to the conditional. Notice that the public reads to
the same arrays are also not equivalent, since indices may leak information. Finally,
the (T-Cond) emits trace $T'$, which according to the select function (Figure 2.5) will
be $T_1$ when the two are equivalent. As such, conditionals in a high context will never produce or-pattern traces (which are not equivalent to any other trace pattern).

In Rule (T-While), the constraint $l \sqcup l_0 \subseteq L$ mandates that loop guards be public (which is why we need not join $l_0$ with $l$ when checking the body $S$). This constraint ensures that the length of the trace as related to the number of loop iterations cannot reveal something about secret data. Fortunately, this restriction is not problematic for many examples because secret arrays can be safely indexed by public values, and thus looping over arrays reveals no information about them.

Finally, we can prove that well typed programs enjoy memory trace obliviousness.

**Theorem 1.** If $\Gamma, l \vdash S; T$, then $S$ satisfies memory trace obliviousness.

The full proof can be found in Appendix A.

### 2.4.4 Examples

Now we consider a few programs that do and do not type check in our system. In the examples, public (low security) variables begin with $p$, and secret (high security) variables begin with $s$; we assume each secret variable is allocated in its own ORAM bank (and ignore statement labels).
There are some interesting differences in our type system and standard information flow type systems. One is that we prohibit low reads under high guards that could differ in both branches. For example, the program if $s > 0$ then $s := p_1$ else $s := p_2$ is accepted in the standard type system but rejected in ours. This is because in our system we allow the adversary to observe public reads, and thus he can distinguish the two branches, whereas an adversary can only observe public writes in the standard noninterference proof. On the other hand, the program if $s > 0$ then $s := p+1$ else $s := p+2$ would be accepted, because both branches will exhibit the same trace.

Another difference is that we do not allow high guards in loops, so a program like the following is acceptable in the standard type system is rejected in ours:

```plaintext
s := slen; sum := 0;
while $s \geq 0$ do
    sum := sum + sarr[p];
    s := s - 1;
done
```

The reason we reject this program is that the number of loop iterations, which in general cannot be determined at compile time, could reveal information about the secret at run-time. In this example, the adversary will observe $O(s)$ memory events and thus can infer $slen$ itself. Prior work on mitigating timing channels often makes the same restriction for the same reason [6, 10, 22, 42]. Similarly, we can mitigate the restrictiveness of our type system by padding out the number of iterations to a
constant value. For example, we could transform the above program to be instead

\[
p := N; \quad \text{sum} := 0;
\]
\[
\text{while } p \geq 0 \text{ do}
\]
\[
\quad \text{if } p < \text{slen} \text{ then } \text{sum} := \text{sum} + \text{sarr}[p];
\]
\[
\quad \text{else } \text{sdummy} := \text{sdummy} + \text{sarr}[p];
\]
\[
\quad p := p - 1;
\]
\[
\text{done}
\]

Here, \(N\) is some constant and \(\text{sdummy}\) and \(\text{sum}\) are allocated in the same ORAM bank. The loop will always iterate \(N\) times but will compute the same \(\text{sum}\) assuming \(N \geq \text{slen}\).

We also do not allow loops with low guards to appear in a conditional with a high guard. As above, we may be able to transform a program to make it acceptable. For example, for some \(S\), the program \(\text{if } s > 0 \text{ then while } (p > 0) \text{ do } S; \text{ done}\) could be transformed to be \(\text{while } (p > 0) \text{ do } \text{if } s > 0 \text{ then } S; \text{ done}\) (assuming \(s\) is not written in \(S\)). This ensures once again that we do not leak information about the loop guard.

2.5 Compilation

Rather than requiring programmers to write memory-trace oblivious programs directly, we would prefer that programmers could write arbitrary programs and rely on a compiler to transform those programs to be memory trace oblivious. While more fully realizing this goal remains future work, we have developed a compiler
algorithm that automates some of the necessary tasks.

In particular, given a program \( P \) in which the inputs and outputs are labeled as \texttt{secret} or \texttt{public}, our compiler will (a) infer the least labels (\texttt{secret} or \texttt{public}) for the remaining, unannotated variables; (b) allocate all \texttt{secret} variables to distinct ORAM banks; (c) insert padding instructions in conditionals to ensure their traces are equivalent; and finally, (d) allocate instructions appearing in high conditionals to ORAM banks. These steps are sufficient to transform the \texttt{max} program in section 2.2 into its memory-trace oblivious counterpart. We can also transform other interesting algorithms, such as \( k \)-means, Dijkstra’s shortest paths, and matrix multiplication, as we discuss in the next section.

We now sketch the different steps of our algorithm.

### 2.5.1 Type checking source programs

The first step is to perform \textit{label inference} on the source program to make sure that we can compile it. This is the standard, constraint-based approach to local type inference as implemented in languages like Jif \cite{Jif} and FlowCaml \cite{FlowCaml}. We introduce fresh constraint variables for the labels of unannotated program variables, and then generate constraints based on the structure of the program text. This is done by applying a variant of the type rules in Figure 2.6, having three differences. First, we treat labels \( l \) has being either \( L \), representing \texttt{public} variables; \( H \), representing \texttt{secret} variables (we can think of this as the only available ORAM bank); or \( \alpha \), representing constraint variables. Second, premises like \( l_1 \sqsubseteq l_2 \) and \( l_0 \sqcup l_1 \sqsubseteq l_2 \)
that appear in the rules are interpreted as generating constraints that are to be solved later. Third, all parts having to do with trace patterns $T$ are ignored. Most importantly, we ignore the requirement that $T_1 \sim_L T_2$ for conditionals.

Given a set of constraints generated by an application of these rules, we attempt to find the least solution to the variables $\alpha$ that appear in these constraints, using standard techniques [30]. If we can find a solution, the compilation may continue. If we cannot find a solution, then we have no easy way to make the program memory-trace oblivious, and so the program is rejected.

As an example, consider the findmax program in Section 2.2, but assume that variables $i$ and $\max$ are not annotated, i.e., they are missing the secret and public qualifiers. When type inference begins, we assign $i$ the constraint variable $\alpha_i$ and $\max$ the constraint variable $\alpha_m$. In applying the variant type rules (with the PC label $l_0$ set to $L$) to this program (that is, the part from lines 5–7), we will generate the following constraints:

\[
\begin{align*}
(\alpha_i \sqcup L) \sqcup L & \sqsubseteq L \quad \text{line 3} \\
\alpha_i & \sqsubseteq H \quad \text{line 4, for } h[i] \text{ in guard} \\
l_0 = \alpha_i \sqcup H \sqcup \alpha_m \sqcup L & \quad \text{PC label for checking if branch} \\
\alpha_i & \sqsubseteq H \quad \text{line 5, for } h[i] \text{ in assignment} \\
l_0 \sqcup (H \sqcup \alpha_i) & \sqsubseteq \alpha_m \quad \text{line 5, assignment} \\
L \sqcup (\alpha_i \sqcup L) & \sqsubseteq \alpha_i \quad \text{line 3}
\end{align*}
\]

(For simplicity we have elided the constraints on location labels that arise due to
We can see that the only possible solution to these constraints is for $\alpha_i$ to be $L$ and $\alpha_m$ to be $H$, i.e., the former is public and the latter is secret.

Assuming that the programmer minimally labels the source program, only indicating those data that must be secret and leaving all other variables unlabeled, then the main restriction on source programs is the restriction on the use of loops: all loop guards must be public, and no loop may appear in a conditional whose guard is high. As mentioned in the previous section, the programmer may transform such programs into equivalent ones, e.g., by using a constant loop bound, or by hoisting loops out of conditionals. We leave the automation of such transformations to future work.

### 2.5.2 Allocating variables to ORAM banks

Given all variables that were identified as secret in the previous stage, we need to allocate them to one or more ORAM banks. At one extreme, we could put all secret variables in a single ORAM bank. The drawback is that each access to a secret variable could cause significant overhead, since ORAM accesses are polylogarithmic in the size of the ORAM [35] (on top of the encryption/decryption cost). At the other extreme, we could put every secret variable in a separate ORAM bank. This lowers overhead by making each access cheaper but will force the next stage to insert more padding instructions, adding more accesses overall. Finally, we could attempt to choose some middle ground between these extreme methods: put some variables
in one ORAM bank, and some variables in others.

Ultimately, there is no analytic method for resolving this tradeoff, as the “break even” point for choosing padding over increased bank size, or vice versa, depends on the implementation. A profile-guided approach to optimizing might be the best approach. With our limited experience so far we observe that storing each secret variable in a separate ORAM bank generally achieves very good performance. This is because when conditional branches have few instructions, the additional padding adds only a small amount of overhead compared to the asymptotic slowdown of increased bank size. Therefore we adopt this method in our experiments. Nevertheless, more work is needed to find the best tradeoff in a practical setting.

We also need to assign secret statements (i.e., those statements whose location label must be $H$) to ORAM banks. At this stage, we assign all statements under a given conditional to the same ORAM bank, but we make a more fine-grained allocation after the next stage, discussed below.

### 2.5.3 Inserting padding instructions

The next step is to insert padding instructions into conditionals, to ensure the final premise of (T-Cond) is satisfied, so that both branches will generate the same traces.

To do this, we can apply algorithms that solve the *shortest common supersequence* problem [31] when applied to two traces (a.k.a. the 2-scs problem). That is, given the two trace patterns $T_i$ and $T_j$ for the true and false branches of an if
Figure 2.8: Finding a short padding sequence using the greatest common subsequence algorithm. An example with two abstract traces $T_t = [T_1; T_2; T_3; T_4; T_5]$ and $T_f = [T_1; T_3; T_2; T_4]$. One greatest common subsequence as shown is $[T_1; T_2; T_4]$. A shortest common super-sequence of the two traces is $T_{tf} = [T_1; T_3; T_2; T_3; T_4; T_5]$.

(following ORAM bank assignment), let $T_{tf}$ denote the 2-scs of $T_t$ and $T_f$. The differences between $T_{tf}$ and the original traces signal where, and what, padding instructions must be inserted. The standard algorithm builds on the dynamic programming solution to the greatest common subsequence (gcs) algorithm, which runs in time $O(nm)$ where $n$ and $m$ are the respective lengths of the two traces [23]. Using this algorithm to find the gcs reveals which characters must be inserted into the original strings, as illustrated in Figure 2.8.

When running 2-scs on traces, we view $T_t$ and $T_f$ as strings of characters which are themselves trace patterns due to single statements. Each statement-level pattern will always consist of zero or more of the following events: $\textbf{Read}$, $o_i$ for ORAM bank $i$. For example, suppose we have the program $\textbf{skip}; x[y] := z$ where, after ORAM bank assignment, the type of $y$ is Nat $o_1$, the type of $z$ is Array $o_1$, and the type of $x$ is Nat $o_2$. This program generates trace pattern $\epsilon @ o_1 @ o_1 @ o_2$. For the purposes of running 2-scs, this trace consists of three characters: $o_1$, $o_1$, and $o_2$, which corresponds to the statement $x[y] := z$.

\footnote{Because of the restrictions imposed by the type system, $T_t$ and $T_f$ will never contain loop patterns, (public) read-array or write patterns, or or-patterns.}
Once we have computed the 2-scs and identified the padding characters needed for each trace, we must generate “dummy” statements to insert in the program that generate the same events. This is straightforward. In essence, we can allocate a “dummy” variable $d_o$ for each ORAM bank $o$ in the program, and then read, write, and compute on that variable as needed to generate the correct event. Suppose we had the program $\text{if}(e, \text{skip}, x[y] := z)$ and thus $T_t = \epsilon$ and $T_f = o_1@o_1@o_2$. Computing the 2-scs we find that $T_t$ can be pre-padded with $o_1@o_1@o_2$ while $T_f$ needs not be padded. We can readily generate statements that correspond to both. For the second, we do not need to pad anything. For the first, we can produce $d_{o_2} := d_{o_1} + d_{o_1}$. When we must produce an event corresponding to a public read, or read from an array, we can essentially just insert a read from that variable directly.

Note that this approach will generate more padding instructions than is strictly needed. In the above example, the final program will be

$$\text{if}(e, (d_{o_2} := d_{o_1} + d_{o_1}; \text{skip}), (x[y] := z))$$

Peephole optimizations can be used to eliminate some superfluous instructions. However, a better approach is to use a finer-grained alphabet which in practice is available when using three address code, i.e., as the intermediate representation of an actual compiler. We will demonstrate this in GhostRider.

Once padding has been inserted, both branches have the same number of statements, and thus we can allocate each pair of statements in its own ORAM bank. Assuming we did not drop the $\text{skip}$ statements in the program above, we
Table 2.1: Programs and parameters used in our simulation.

<table>
<thead>
<tr>
<th>No.</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Dijkstra ( n = 100 ) nodes</td>
</tr>
<tr>
<td>2</td>
<td>K-means ( n = 100 ) data points, ( k = 2 ), ( I = 1 ) iteration</td>
</tr>
<tr>
<td>3</td>
<td>Matrix multiplication ( n \times n ) matrix where ( n = 40 )</td>
</tr>
<tr>
<td>4</td>
<td>Matrix multiplication ( n \times n ) matrix where ( n = 25 )</td>
</tr>
<tr>
<td>5</td>
<td>Find max ( n = 100 ) elements in the array</td>
</tr>
<tr>
<td>6</td>
<td>Find max ( n = 10000 ) elements in the array</td>
</tr>
</tbody>
</table>

could allocate them both in ORAM bank \( o_3 \) and allocate the two assignments in ORAM bank \( o_4 \), rather than allocate all instructions in ORAM bank \( o \) as is the case now.

2.6 Evaluation

To demonstrate the efficiency gains achieved by our compiler in comparison with the straightforward approach of placing all secret variables in the same ORAM bank, we choose four example programs: Dijkstra single-source shortest paths, K-means, Matrix multiplication (naïve \( O(n^3) \) implementation), and Find-max.

We will compare three different strategies:

**Strawman:** Place all secret variables in the same ORAM bank.

**Opt 1:** Store each variable in a separate ORAM bank, but store whole arrays in the same bank.

**Opt 2:** Store each variable and each member of an array in a separate ORAM bank.

In all three cases, we insert necessary padding to ensure obliviousness.
2.6.1 Simulation Results

We also performed simulation to measure the performance of the example programs when compiled by our compiler. Table 2.1 shows the parameters we choose for our experiment. We built a simulator in C++ that can measure the number of memory accesses for transformed programs. Implementing a full-fledged compiler that integrates with our ORAM-capable hardware concurrently being built [?] is left as future work.

Simulation results are given in Figure 2.9 for the six setups described in Table 2.1. The ORAM scheme we used in the simulation is due to Shi et al [80]. The figure shows that Opt 1 is 1.3 to 5 times faster the strawman scheme; and Opt 2 is 1 to 3 orders of magnitude faster than the strawman for the chosen programs and parameters.
2.7 Conclusion Remarks

In this chapter, we have briefly explained how to achieve memory trace obliviousness (MTO) in a basic client-server setting. We have demonstrated how MTO is defined under $\mathcal{L}_{\text{basic}}$ syntax and semantics, and how a type system can enforce MTO by keeping track of traces. Throughout the rest of this thesis, we will exploit the same idea in more settings, where hardware (in GhostRider) or algorithmic (in ObliVM) constraints will require reasoning about more leakage. We will also empirically demonstrate the advantages of the MTO approach in these settings.
Chapter 3: GhostRider: A Compiler-Hardware Approach

3.1 Introduction

In Chapter 2, we have explained the basic idea how a type system can enforce a program to be memory trace oblivious. In this chapter, we consider a realistic setting to protect the programs against physical attackers who have control to everything except the processor. As explained in above, data encryption is necessary but not sufficient. Our approach is to build a processor with an ORAM controller [28, 63] so that it can obfuscate the memory accesses to the ORAM. As explained in Chapter 2, this is not the most efficient approach, since when a program’s memory access pattern does not depend on secret data, encryption is sufficient and ORAM is not necessary. To enable such an optimization, we enhance the existing ORAM processor design to allow splitting memory into regions, consisting of one or more ORAM banks, encrypted memory, and normal DRAM.

On top of such a design, building a compiler from a high-level source language into a low-level assembly language is non-trivial, even given our results from Chapter 2. In particular, the cache behavior and handling assembly code make designing a MTO type system much more challenging.

In this Chapter, we present GhostRider as a hardware-compiler co-design ap-
proach to achieve memory trace obliviousness and efficiency at the same time.

This Chapter is based on a paper that I co-authored with Austin Harris, Michael Hicks, Martin Maas, Mohit Tiwari, and Elaine Shi [57]. I developed the formalism and the proof under the help of Michael Hicks. I also developed the compiler, which implements both the optimization and the type checker, and emits code that is runnable on an FPGA implementation developed by my co-authors Austin Harris, Martin Maas, and Mohit Tiwari. I conducted experiments to show the compiler’s effectiveness with the help of Austin Harris, Martin Maas, and Mohit Tiwari.

3.1.1 Our Results and Contributions.

In this paper, we make the first endeavor to bring the theory of MTO to practice. We design and build GhostRider, a hardware/software platform for provably secure, memory-trace oblivious program execution. Compiling to a realistic architecture while formally ensuring MTO raises interesting challenges in the compiler and type system design, and ultimately requires a co-operative re-design of the underlying processor architecture. Our contributions are:

**New compiler and type system.** We build the first memory-trace oblivious compiler that emits target code for a realistic ORAM-capable processor architecture. The compiler must explicitly handle low-level resource allocation based on underlying hardware constraints, and while doing so is standard in non-oblivious compilers, achieving them while respecting the MTO property is non-trivial. Stan-
standard resource allocation mechanisms would fail to address the MTO property. For example, register allocation spills registers to the stack, thereby introducing memory events. Furthermore, caching serves memory requests from an on-chip cache, which suppresses memory events. If these actions are correlated with secret data, they can leak information. We introduce new techniques for resolving such challenges. In lieu of implicit caches we employ an explicit, on-chip scratchpad. Our compiler implements caching in software when its use does not compromise MTO.

To formally ensure the MTO property, we define a new type system for a RISC-style low-level assembly language. We show that any well-typed program in this assembly language will respect memory-trace obliviousness during execution. When starting from source programs that satisfy a standard information flow type system [26], our compiler generates type-correct, and therefore safe, target code. Specifically, we implement a type checker that can verify the type-correctness of the target code.

**Processor architecture for MTO program execution.** To enable an automated approach for efficient memory-trace oblivious program execution, we need new hardware features that are not readily available in existing ORAM-capable processor architectures [27, 29, 63]. GhostRider builds on the Phantom processor architecture [63] but exposes new features and knobs to the software. In addition to supporting a scratchpad, as mentioned above, the GhostRide architecture complements Phantom’s ORAM support with encrypted RAM (ERAM), which is not oblivious and therefore more efficiently supports variables whose access patterns are not sensitive. Section 3.6 describes additional hardware-level contributions. We proto-
typed the GhostRider processor on a Convey HC2 platform [21] with programmable FPGA support. The GhostRider processor supports the RISC-V instruction set [93].

**Implementation and Empirical Results.** Our empirical results are obtained through a combination of software emulation and experiments on an FPGA prototype. Our FPGA prototype supports one ERAM bank, one code ORAM bank, and one data ORAM bank. The real processor experiments demonstrate the feasibility of our architecture, while the software simulator allows us to test a range of configurations not limited by the constraints of the current hardware. In particular, the software simulator models multiple ORAM banks at a higher clock rate.

Our experimental results show that compared to the baseline approach of placing everything in a single ORAM bank, our compile-time static analysis achieves up to nearly an order-of-magnitude speedup for many common programs.

### 3.2 Architecture and Approach

This section motivates our approach and presents an overview of GhostRider’s hardware/software co-design.

#### 3.2.1 Motivating example

We wish to support a scenario in which a client asks an untrusted cloud provider to run a computation on the client’s private data. For example, suppose the client wants the provider to run the program shown in Figure 3.1, which is a simple histogram program written in a C-like source language. As input, the program takes
void histogram(secret int a[], // ERAM
               secret int c[]) { // ORAM (output)
    public int i;
    secret int t, v;
    for(i=0;i<100000;i++) // 100000 <= len(c)
        c[i]=0;
    i=0;
    for(i=0;i<100000;i++) { // 100000 <= len(a)
        v=a[i];
        if(v>0) t=v%1000;
        else t=(0-v)%1000;
        c[t]=c[t]+1; }

Figure 3.1: Motivating source program of GhostRider.

an integer array a, and as output it modifies integer array parameter c. We assume both arrays have size 100,000. The function’s code is straightforward, computing the histogram of the absolute values of integers modulo 1000 appearing in the input array. The client’s security goal is data confidentiality: the cloud provider runs the program on input array a, producing output array c, but nevertheless learns nothing about the contents of either a or c. We express this goal by labeling both arrays with the qualifier secret (data labeled public is non-sensitive).

3.2.2 Threat model

The adversary has physical access to the machine(s) being used to run client computations. As in prior work that advocates the minimization of the hardware trusted computing base (TCB) [85–87], we assume that trust ends at the boundary of the secure processor. Off-chip components are considered insecure, including memory, system buses, and peripherals. For example, we assume the adversary can observe the contents of memory, and can observe communications on the bus
between the processor and the memory. By contrast, we assume that on-chip components are secure. Specifically, the adversary cannot observe the contents of the cache, the register file, or any on-chip communications. Finally, we assume the adversary can make fine-grained timing measurements, and therefore can learn, for example, the gap between observed events. Analogous side channels such as power consumption are outside the scope of this paper.

3.2.3 Architectural Overview

As mentioned in the introduction, one way to defend against such an adversary is to place all data in a single (large) ORAM; e.g., for the program in Figure 3.1 we place the arrays a and c in ORAM. Unfortunately this baseline approach is not only expensive, but also leaks information through the total number of ORAM accesses (if the access trace is not padded to a value that is independent of secret data). We
now provide an architectural overview of GhostRider (Figure 3.2) and contrast it with this baseline.

**Joint ORAM-ERAM memory system** In the GhostRider architecture, main memory is split into three types—normal (unencrypted) memory (RAM), encrypted memory (ERAM), and oblivious RAM (ORAM)—with one or more (logical) banks of each type comprising the system’s physical memory. The differentiation of memory into banks allows a compiler to place only arrays with sensitive access patterns inside the more expensive ORAM banks, while keeping the remaining data in the significantly faster RAM or ERAM banks. For example, notice that in the program in Figure 3.1 the array \texttt{a} is always accessed sequentially while access patterns to the array \texttt{c} can depend on secret array contents. Therefore, our GhostRider compiler can place the array \texttt{a} inside an ERAM bank, and place the array \texttt{c} inside an ORAM bank. The program accesses different memory banks at the level of blocks using instructions that specify the bank and a block-offset within the bank (after moving data to on-chip memory as described below). Our hardware prototype fixes block sizes to be 4KB for both ERAM and ORAM banks (which is not an inherent limitation of the hardware design).

**Software-directed scratchpad** As mentioned earlier, cache hit and miss behavior can lead to differences in the observable memory traces. To prevent such cache-channel information leakage, the GhostRider architecture turns off implicit caching, and instead offers software-directed scratchpads for both instructions and data. These scratchpads are mapped into the program’s address space so that the compiler can generate code to access them explicitly, and thereby avoid information leaks.
For example, the indices of array $a$ in Figure 3.1 are deterministic; they do not depend on any secret input. As such, it is safe to use the scratchpad to cache array $a$’s accesses. The compiler generates code to check whether the relevant block is in the scratchpad, and if not loads the block from memory. On the other hand, all accesses to array $c$ depend on the secret input $a$, so a memory request will always be issued independent of whether the requested block is in the scratchpad or not.

**Deterministic Processor Pipeline** To avoid timing-channel leakage, our pipelined processor ensures that instruction timings are deterministic. We do not use dynamic branch prediction and fix variable-duration instructions, such as division, to take the worst-case execution time, and disable concurrent execution of other instructions.

**Initialization** We design the oblivious processor and memory banks as a *co-processor* that runs the application natively (i.e., without an OS) and is connected to a networked host computer that can be accessed remotely by a user. We assume that the secure co-processor has non-volatile memory for storing a long-term public key (certified using PKI), such that the client can securely ship its encrypted code and data to the remote host, and initialize execution on the secure co-processor. Implementing the secure attestation is standard [4], and we leave it to future work.

### 3.3 Formalizing the target language

This section presents a small formalization of GhostRider’s instruction set, which we call $\mathcal{L}_{\text{GhostRider}}$. The next section presents a type system for this language that guarantees security, and the following section describes our compiler from a
\( m, n \in \mathbb{Z} \quad \quad o_1, \ldots, o_n \in \text{ORAMbanks} \)

\( k \in \text{Block IDs} \quad r \in \text{Registers} \)

\( l \in \text{Labels} = \{D, E\} \cup \text{ORAMbanks} \)

\[ i ::= \text{ldb} \ k \leftarrow l[r] \quad \text{load block to scratchpad} \]
\[ | \quad \text{stb} \ k \quad \quad \text{store block to memory} \]
\[ | \quad r \leftarrow \text{idb} \ k \quad \text{retrieve the block ID} \]
\[ | \quad \text{ldw} \ r_1 \leftarrow k[r_2] \quad \text{load a scratchpad val. to reg.} \]
\[ | \quad \text{stw} \ r_1 \rightarrow k[r_2] \quad \text{store a reg. val. to scratchpad} \]
\[ | \quad r_1 \leftarrow r_2 \ aop \ r_3 \quad \text{compute an operation} \]
\[ | \quad r_1 \leftarrow n \quad \text{assign a constant to a register} \]
\[ | \quad \text{jmp} \ n \quad \text{(relative) jump} \]
\[ | \quad \text{br} \ r_1 \ rop \ r_2 \leftarrow n \quad \text{compare and branch} \]
\[ | \quad \text{nop} \quad \text{empty operation} \]

\[ I ::= i \mid I; i \quad \text{instruction sequence} \]

Figure 3.3: Syntax for \( \mathcal{L}_{\text{GhostRider}} \) language, comprising (1) \text{ldb} and \text{stb} instructions that move data blocks between scratchpad and a specific ERAM or ORAM bank, and (2) scratchpad-to-register moves and standard RISC instructions.

C-like source language to well-typed \( \mathcal{L}_{\text{GhostRider}} \) programs.

3.3.1 Instruction set

The core instructions of \( \mathcal{L}_T \) are in the style of RISC-V [93], our prototype’s instruction set, and are formalized in Figure 3.3. We define labels \( l \) that distinguish the three kinds of main memory: \( D \) for normal (D)RAM, \( E \) for ERAM, and \( o_i \) for ORAM. For the last, the \( i \) identifies a particular ORAM bank. We can view each label as defining a distinct address space.

The instruction \text{ldb} \( k \leftarrow l[r] \) loads a block from memory into the scratchpad.\(^1\)

Here, \( l \) is the address space, \( r \) is a register containing the address of the block to load from within that address space, and \( k \) is the scratchpad block identifier. Our

\(^1\)In our hardware prototype the scratchpad is mapped into addressable memory, so this instruction and its counterpart, \text{stb}, are implemented as data transfers. In addition, the compiler implements \text{idb}. We model them in \( \mathcal{L}_T \) explicitly for simplicity; see Section 3.6 for implementation details.
formalism refers to scratchpad blocks by their identifier, treating them similarly to registers. Our architecture remembers the address space and block address within that address space that the scratchpad block was loaded from so that writebacks, via the \texttt{stb} $k$ instruction, will go to the original location. We enforce this one-to-one mapping to avoid information leaks via write-back from the scratchpad (e.g., that where a scratchpad block is written to, in memory, could reveal information about a secret, or that the effect of a write could do so, if blocks are aliased).

To access values from the scratchpad, we have scratchpad-load and scratchpad-store instructions. The scratchpad-load instruction loads a word from a scratchpad block, having the form $\texttt{ldw} \ r_1 \leftarrow k[r_2]$. Assuming register $r_2$ contains $n$, this instruction loads the $n$-th word in block $k$ into register $r_1$ (notice that we use word-oriented addressing, not byte-oriented). The scratchpad-store instruction is similar, but goes in the reverse direction. The instruction $r \leftarrow \texttt{idb} \ k$ retrieves the block offset of a scratchpad block $k$.

We have two kinds of assignment instructions, one in the form of $r_1 \leftarrow r_2 \ aop \ r_3$; and the other in the form of $r \leftarrow n$. In $\mathcal{L}_T$ we only model integer arithmetic operations, such as addition, subtraction, multiplication, division, and modulus.

Jumps and branches use relative addressing. The jump instruction $\texttt{jmp} \ n$ bumps the program counter by $n$ instructions (where $n$ can be negative). Branches, having the form $\texttt{br} \ r_1 \ rop \ r_2 \leftrightarrow n$, will compare the contents of $r_1$ and $r_2$ using $rop$, and will bump the pc by $n$ if the comparison result is true. An instruction sequence $I$ is defined to be a sequence of instructions concatenated using a logical operation ;.
We overload ; to operate over two instruction sequences such that $I; (I'; i) \triangleq (I; I'); i$ and $I_1; I_2; I_3 \triangleq (I_1; I_2); I_3$.

Note that our formalism does not model the instruction scratchpad; essentially it assumes that all code is loaded on-chip prior to the start of its execution. Section 3.5 discusses how the instruction scratchpad is used in practice.

### 3.3.2 Example

Figure 3.4 shows $\mathcal{L}_{\text{GhostRider}}$ code that corresponds to the body of the second for loop in the source program from Figure 3.1. We write $r_X$ for a register corresponding to variable $X$ in the source program (for simplicity) and write $t_i$ for $i \in \{1, 2, \ldots\}$ for temporary registers. In the explanation we refer to the names of variables in the source program when describing what the target program is computing.

The first four lines load the $i$th element of array $a$ into $v$. Line 1 computes the address of the block in memory that contains the $i$th element of array $a$ and
line 2 computes the offset of the element within that block. Here $size_{blk}$ is the size
of each block, which is an architecture constant. Line 3 then loads the block from
ERAM, and line 4 loads the appropriate value from the loaded block into $v$.

The next five lines implement the conditional. Line 5 jumps three instructions
forward if $v$ is not greater than 0, else it falls through to line 6, which computes $t$.
Line 7 then jumps past the else branch, which begins on line 8, which negates $v$ to
make it positive before computing $t$.

The final seven lines increment $c[t]$. Lines 10–13 are analogous to lines 1–4;
they compute the address of $t$th element of array $c$ and load it into temporary
$t_3$. Notice that this time the block is loaded from ORAM, not ERAM. Line 14
increments the temporary; line 15 stores it back to the block in the scratchpad; and
line 16 stores the entire block back to ORAM.

3.4 Security by typing

This section presents a type system for $L_T$ that guarantees programs obey the
strong memory trace obliviousness (MTO) security property.

3.4.1 Memory Trace Obliviousness

Memory trace obliviousness is a noninterference property that also considers
the address trace, rather than just the initial and final contents of memory [79].
MTO’s definition relies on the notion of low equivalence which relates memories
whose RAM contents are identical. We formally define this notion below, using the
following formal notation:

\[ M \in \text{Addresses} \to \text{Blocks} \]
\[ a \in \text{Addresses} = \text{Labels} \times \text{Nat} \]
\[ b \in \text{Blocks} = \text{Nat} \to \text{Z} \]

We model a memory \( M \) as a map from addresses to blocks, where an address is a pair consisting of a label \( l \) (corresponding to an ORAM, ERAM, or RAM bank, as per Figure 3.3) and an address \( n \) in that bank. A block is modeled as a map from an address \( n \) to a (integer) value. Here is the definition of memory low-equivalence:

**Definition 4** (Memory low equivalence). *Two memories \( M_1, M_2 \) are low equivalent, written \( M_1 \sim_L M_2 \), if and only if for all \( n \) such that \( 0 \leq n < \text{size}(D) \) we have \( M_1(D,n) = M_2(D,n) \).*

The definition states that memories \( M_1 \) and \( M_2 \) are low equivalent when only the RAM bank’s values of the memories are the same, but all of the other values could differ.

Intuitively, memory trace obliviousness says two things given two low-equivalent memories. First, if the program will terminate under one memory, then it will terminate under the other. Second, if the program will terminate and lead to a trace \( t \) under one memory, then it will do so under the other memory as well while also finishing with low-equivalent memories.

To state this intuition precisely, we need a formal definition of a \( \mathcal{L}_T \) execution, which we give as an operational semantics. The semantics is largely stan-
dard, and can be found in the Appendix. The key judgment has the form $I \vdash (R, S, M, pc) \rightarrow_t (R', S', M', pc')$, which states that program $I$, with a register file $R$, a (data) scratchpad $S$, a memory $M$, and a program counter $pc$, executes some number of steps, producing memory trace $t$ and resulting in a possibly modified register file $R'$, scratchpad $S'$, memory $M'$, and program counter $pc'$.

**Definition 5** (Memory trace obliviousness). A program $I$ is memory trace oblivious if and only if for all memories $M_1 \sim_L M_2$ we have $I \vdash (R_0, S_0, M_1, 0) \rightarrow_{t_1} (R'_1, S'_1, M'_1, pc_1)$, and $I \vdash (R_0, S_0, M_2, 0) \rightarrow_{t_2} (R'_2, S'_2, M'_2, pc_2)$, and $|t_1| = |t_2|$ implies $t_1 \equiv t_2$ and $M'_1 \sim_L M'_2$.

Here $R_0$ is a mapping that maps every register to 0, and $S_0$ maps every address to a all-0 block. Traces $t$ consist of reads/writes to RAM (both address and value) and ERAM (just the address), accesses to ORAM (just the bank), and instruction fetches. For the last we only model that a fetch happened, not what instruction it is, as we assume code will be stored in a scratchpad on chip. We write $t_1 \equiv t_2$ to say that traces $t_1$ and $t_2$ are indistinguishable to the attacker; i.e., they consist of the same events in the same order. Our formalism models every instruction as taking unit time to execute – thus the trace event also models the time taken to execute the instruction. On the real GhostRider architecture, each instruction takes deterministic but non-uniform time; as this difference is conceptually easy to handle (by accounting for instruction execution times in the compiler), we do not model it formally, for simplicity (see Section 3.5).
Sym. vals. \( sv \in SymVals = n \mid ? \mid sv_1 aop sv_2 \mid M_k[s, sv] \)
Sym. Store \( Sym \in \text{Registers} \cup \text{Block IDs} \rightarrow SymVals \)
Sec. Labels \( \ell \in \text{SecLabels} = L \mid H \)
Label Map \( \Upsilon \in (\text{Registers} \rightarrow \text{SecLabels}) \cup (\text{Block IDs} \rightarrow \text{Labels}) \)

\[
sv_1 \equiv sv_2 \\
\frac{\vdash \text{safe } sv_1 \quad \vdash \text{safe } sv_2}{sv_1 = sv_2} \\
\frac{sv_1 \equiv sv_2}{\vdash \text{safe } sv_1 aop sv_2}
\]

Auxiliary Functions

\[
select(l, a, b, c) = \begin{cases} 
    a & \text{if } l = D \\
    b & \text{if } l = E \\
    c & \text{if otherwise.}
    \end{cases}
\]

\[
slab(l) = select(l, L, H, H)
\]

\[
ite(x, a, b) = \begin{cases} 
    a & \text{if } x \text{ is true.} \\
    b & \text{otherwise.}
\end{cases}
\]

\[
\forall r. \vdash \text{const } Sym(r) \\
\forall k. \vdash \text{const } Sym(k) \\
\vdash \text{const } Sym
\]

Figure 3.5: Symbolic values, labels, auxiliary judgments and functions

3.4.2 Typing: Preliminaries

Now we give a type system for \( L_{\text{GhostRider}} \) programs and prove that type correct programs are MTO.

Symbolic values To ensure that the execution of a program cannot leak information via its address trace, we statically approximate what events a program can produce. An important determination made by the type system is when a secret
variable can be stored in ERAM—because the address trace will leak no information about it—and when it must be accessed in ORAM. As an example, suppose we had the source program

```plaintext
if(s) then x[i]=1 else x[i]=2;
```

If \( x \) is secret but stored in RAM, then the value in \( x[i] \) after running this program will leak the contents of secret variable \( s \). We could store \( x \) in ORAM to avoid this problem, but this is unnecessary: both branches will modify the same element of \( x \), so encrypting the content of \( x \) is enough to prevent the address trace from leaking information about \( s \). The type system can identify this situation by symbolically tracking the contents of the registers, blocks, etc.

To do this, the type rules maintain a *symbolic store* \( \text{Sym} \), which is a map from register and block IDs to symbolic values. Figure 3.5 defines symbolic values \( sv \), which consist of constants \( n \), (symbolic) arithmetic expressions, values loaded from memory \( M_l[k,sv] \), and unknowns ?. Most interesting is memory values, which represent the address of a loaded value: \( l \) indicates the memory bank it was loaded from, \( sv \) corresponds to the offset (i.e., the block number) within that bank, and \( k \) is the scratchpad block into which the memory block is loaded.\(^2\)

The type rules also make use of a *label map* \( \Upsilon \) mapping registers to security labels and block IDs to (memory) labels; the latter tracks the memory bank from which a scratchpad block was loaded.

The figure defines several judgments; the form of each judgment is boxed. The

---

\(^2\)In actual traces \( t \), the block number \( k \) is not visible; we track it symbolically to model the scratchpad’s contents, in particular to ensure that the same memory block is not loaded into two different scratchpad blocks.
first defines when two symbolic values can be deemed equivalent, written \( sv_1 \equiv sv_2 \): they must be syntactically identical and safe static approximations. The latter is defined by the judgment \( \vdash_{\text{safe}} sv \), which accepts constants, memory accesses to RAM involving safe indexes, and arithmetic expressions involving safe values. Judgement \( \vdash_{\text{const}} sv \) says that symbolic value \( sv \) is not a memory value. That is, \( sv \) is either a constant, a \( ? \), or a binary expression not involving memory values. Further, for a symbolic store \( Sym \), if all the symbolic values that it maps to can be accepted by \( \vdash_{\text{const}} sv \), then we have \( \vdash_{\text{const}} Sym \). The latter judgment is needed when checking conditionals.

Finally, we give three auxiliary functions used in the type system. Based on whether \( l \) is \( D \), \( E \), or an ORAM bank, function \( \text{select}(l, a, b, c) \) returns \( a \), \( b \), or \( c \) respectively. Function \( \text{slab}(\cdot) \) maps a normal label \( l \) to a security label \( \ell \), which is either \( L \) or \( H \). The label \( H \) classifies encrypted memory—any ORAM bank and ERAM—while label \( L \) classifies RAM. These two labels form the two-point lattice with \( L \sqsubseteq H \). Note that \( L \) is equivalent to the public label used in Figure 3.1, and \( H \) is equivalent to secret. Finally, function \( \text{ite}(x, a, b) \) returns \( a \) if \( x \) is true, and returns \( b \) if \( x \) is false.

**Trace patterns** Figure 3.6 defines trace patterns \( T \), which are largely similar to those for \( \mathcal{L}_{\text{basic}} \) that approximate traces \( t \). The first line in the definition of \( T \) defines single events. The first two indicate reads and writes to RAM or ERAM; they reference the memory bank, block identifier in the scratchpad, and a symbolic value corresponding to the block address (not the actual value) read or written. Pattern \( F \) corresponds to a non memory-accessing instruction. The next pattern indicates
Trace Pats. $T ::= \text{read}(l, k, sv) \mid \text{write}(l, k, sv) \mid F \mid o$

$\mid T_1@T_2 \mid T_1 + T_2 \mid \text{loop}(T_1, T_2)$

\[
\begin{align*}
sv_1 &\equiv sv_2 \\
\text{read}(l, k, sv_1) &\equiv \text{read}(l, k, sv_2) \\
\text{write}(l, k, sv_1) &\equiv \text{write}(l, k, sv_2) \\
T_1 &\equiv T_1' \\
T_2 &\equiv T_2' \\
T_1@T_2 &\equiv T_1@T_2'
\end{align*}
\]

Figure 3.6: Trace patterns and their equivalence in $\mathcal{L}_{\text{GhostRider}}$

A read or write from ORAM bank $o$: this bank is the trace event itself because the adversary cannot determine whether an access is a read or a write, or which block within the ORAM is accessed. Trace pattern $T_1@T_2$ is the pattern resulting from the concatenation of patterns $T_1$ and $T_2$. Pattern $T_1 + T_2$ represents either $T_1$ or $T_2$, and is used to type conditionals. Finally, pattern $\text{loop}(T_1, T_2)$ represents zero or more loop iterations where the guard’s trace is $T_1$ and the body’s trace is $T_2$.

Trace pattern equivalence $T_1 \equiv T_2$ is defined in Figure 3.6. In this definition, reads are equivalent to other reads accessing exactly the same location; the same goes for writes. Two ORAM accesses to the same ORAM bank are obviously treated as equivalent. Sum patterns specify possibly different trace patterns, and loop patterns do not specify the number of iterations; as such we cannot determine their equivalence statically. The concatenation operator @ is associative with respect to equivalence.
T-LOAD

\[ l \not\in \text{ORAMbanks} \Rightarrow \Upsilon(r) = L \]
\[ \Upsilon' = \Upsilon[k \mapsto l] \quad \text{Sym}' = \text{Sym}[k \mapsto \text{Sym}(r)] \]
\[ T_0 = \text{read}(l, k, \text{Sym}(r)) \quad T = \text{select}(l, T_0, T_0, l) \]
\[ \ell \vdash \text{ldb} \; k \leftarrow l[r] : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T \]

T-STORE

\[ \text{Sym}(k) = sv \quad \Upsilon(k) = l \]
\[ T_0 = \text{write}(l, k, sv) \quad T = \text{select}(l, T_0, T_0, l) \]
\[ \ell \vdash \text{stb} \; k : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon, \text{Sym} \rangle; T \]

T-LOADW

\[ l = \Upsilon(k) \quad \Upsilon'(r_2) \sqsubseteq \text{slab}(l) \quad \text{Sym}' = \text{Sym}[r_1 \mapsto \text{Sym}(r_2)] \]
\[ \ell \vdash \text{ldw} \; r_1 \leftarrow k[r_2] : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; F \]

T-STOREW

\[ \ell \sqcup \Upsilon(r_1) \sqcup \Upsilon(r_2) \sqsubseteq \text{slab}(\Upsilon(k)) \]
\[ \ell \vdash \text{stw} \; r_1 \rightarrow k[r_2] : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon, \text{Sym} \rangle; F \]

T-IDB

\[ \text{Sym}(k) = sv \quad \Upsilon(k) = l \]
\[ \Upsilon' = \Upsilon[r \mapsto \text{select}(l, L, L, H)] \quad \text{Sym}' = \text{Sym}[r \mapsto \text{Sym}(r)] \]
\[ \ell \vdash r \leftarrow \text{idb} \; k : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; F \]

T-BOP

\[ \Upsilon'(r_2) \sqcup \Upsilon(r_3) \quad \Upsilon'(r_1 \mapsto \ell) \]
\[ \text{Sym}' = \text{Sym}[r \mapsto \text{Sym}(r)] \]
\[ \ell \vdash r_1 \leftarrow r_2 \text{ aop } r_3 : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; F \]

T-ASSIGN

\[ \Upsilon'(r \mapsto L) \quad \text{Sym}' = \text{Sym}[r \mapsto n] \]
\[ \ell \vdash r \leftarrow n : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; F \]

T-NOP

\[ \ell \vdash \text{nop} : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon, \text{Sym} \rangle; F \]

T-SEQ

\[ \ell \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T_1 \]
\[ \ell \vdash \iota : \langle \Upsilon', \text{Sym}' \rangle \rightarrow \langle \Upsilon'', \text{Sym}'' \rangle; T_2 \]
\[ \ell \vdash I ; \iota : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon'', \text{Sym}'' \rangle; T_1 \odot T_2 \]

Figure 3.7: Security Type System for \( \mathcal{L}_{\text{GhostRider}} \) (Part 1)
3.4.3 Type rules

Figures 3.7, 3.8, 3.9 define the security type system for \( \mathcal{L}_T \). The figures are divided into three parts.

**Instructions** Figure 3.7 presents judgment \( \ell \vdash \iota : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle ; T \), which is used to type instructions \( \iota \). Here, \( \ell \) is the security context, used in the standard way to prevent implicit flows. The rules are flow sensitive: The judgement says that instruction \( \iota \) has a type \( \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle \), and generates trace pattern \( T \). Informally, we can say by executing \( \iota \), a state corresponding to security type \( \langle \Upsilon, Sym \rangle \) will be changed to have type \( \langle \Upsilon', Sym' \rangle \).

Rule T-LOAD types load instructions. The first premise ensures that the contents of register \( r \), the indexing register, are not leaked by the operation. In particular, the loaded memory bank \( l \) must either be ORAM, or else the register \( r \) may only contain public data (from RAM). In the latter case, there is no issue with leaking \( r \), and in the former case \( r \) will not be leaked indirectly by the address of the loaded memory since it is stored in ORAM. The final two premises determine the final trace pattern. When the memory bank \( l \) is \( D \) or \( E \), then the trace pattern indicates a read event from the appropriate block and address. When reading from an ORAM bank the event is just that bank itself. The other premises in the rule update \( \Upsilon \) to map the loaded block \( k \) to the label of the memory bank, and update \( Sym \) to track the address of the block.

We defer discussion of rule T-STORE for the moment, and look at the next three rules, T-LOADW, T-STOREW, T-IDB, which are used to load and store
values related blocks in the scratchpad. The first two rules resemble standard information flow rules. The second premise of T-LOADW is similar to the first premise of T-LOAD in preventing an indirect leak of index register \( r_2 \), which would occur if the label of \( r_2 \) was \( H \) but the label of \( k \) was \( L \). Likewise, the premise of T-STOREW prevents leaking the contents of \( r_1 \) and \( r_2 \) into the stored block, and also prevents an implicit flow from \( \ell \) (the security context). As such, these two rules ensure that a block \( k \) with label \( \ell \) never contains information from memory labeled \( \ell' \) such that \( \ell' \sqsubset \ell \). The remaining premises of Rule T-LOADW flow-sensitively track the label and symbolic value of the loaded register. In particular, they set the label of \( r_1 \) to be that of the block loaded, and the symbolic value of \( r_1 \) to be the address of the loaded value in memory. T-STOREW changes neither \( \Upsilon \) nor \( \text{Sym} \): even though the content of the scratchpad has changed, its memory label and its address in memory has not. Both rules emit trace pattern \( F \) as the operations are purely on-chip. We emit this event to account for the time taken to execute an instruction; assuming uniform times for instructions and memory accesses, MTO executions will also be free of timing channels.

Returning to rule T-STORE, we can see that the store takes place unconditionally—no constraints on the labels of the memory or block must be satisfied. This is because the other type rules ensure that all blocks \( k \) never contain information higher than their security label \( \ell \), and thus the block can be written straight to memory having the same security label. That said, information could be leaked through the memory trace, so the emitted trace pattern will differ depending on the label of the block: If the label is \( D \) or \( E \) then the trace pattern will be a write event, and otherwise it will
Branching

$\ell = \ell' \sqcup \Upsilon(r_1) \sqcup \Upsilon(r_2)$

$\ell \vdash I_t : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle; T_1$

$\ell \vdash I_f : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle; T_2$

$\ell' = H \Rightarrow \{\ell = L \Rightarrow \vdash_{\text{const}} Sym \land \forall r. \Upsilon'(r) = L \Rightarrow \vdash_{\text{safe}} Sym'(r)\}$

$T = \text{ite}(\ell' = H, F @ T_1 @ F, F @ (T_1 @ F) + T_2)$

$T = \vdash I : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle; T$

Figure 3.8: Security Type System for $\mathcal{L}_{\text{GhostRider}}$ (Part 2)

be the appropriate ORAM event. Leaks via the memory trace are then prevented by T-IF and T-LOOP, discussed shortly.

Rule T-IDB is similar to rule T-LOADW. For the third premise, if $l$ is either $D$ or $E$, the block $k$ has a public address, and thus the value assigned to register $r$ is public; otherwise, when $l$ is an ORAM bank, the register $r$ is secret.

Rule T-BOP types binary operations, updating the security label of the target register to be the join of labels of the source registers. Rule T-ASSIGN gives the target register label $L$ as constants are not secret. Rules T-NOP is always safe and has no effect on the symbolic store or label environment. All of these operations occur on-chip, and so have pattern $F$. Finally, rule T-SEQ types instruction sequences by composing the symbolic maps, label environments, and traces in the obvious way.
Branching Figure 3.8 presents rules T-IF and T-LOOP for structured control flow. Rule T-IF deals with instruction sequences of the form of $I = \iota_1; I_t; \iota_2; I_f$, where $\iota_1$ is a branching instruction deciding, $\iota_2$ is a jump instruction jumping over the false branch, and $I_t$ and $I_f$ are the true and false branches respectively; the relative offsets $n_1$ and $n_2$ are based on the length of these code sequences. We require both branches to have the same type, i.e. $\langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle$, as for the sequence $I$ itself.

When the security context is high, i.e. $\ell = H$, or when the if-condition is private, i.e. $\Upsilon(r_1) \sqcup \Upsilon(r_2) = H$, then $\ell'$ will be $H$ and we impose three restrictions. First, both of the blocks $I_t$ and $I_f$ must have equivalent trace patterns. (The trace of the true branch is $T_1@F$ where $T_1$ covers $I_t$ and $F$ covers the jump instruction $\iota_2$.) Second, if the security context is public, i.e. $\ell = L$, then we restrict $\vdash_{const} Sym$ to enforce $Sym(r)$ does not map to memory values. The reason is that in a public context, two equivalent symbolic memory values may refer to two different concrete values, since the memory region $D$ can be modified. Third, for any register $r$, its value after taking either branch must be the same, or the register $r$ must have a high security label (i.e. $\Upsilon'(r) = H$). So if $\Upsilon'(r) = L$, the type system enforces that its symbolic values on the two paths are equivalent, i.e. $Sym'(r) \equiv Sym'(r)$, which only requires $\vdash_{safe} Sym'(r)$.

The final premise for rule T-IF states that the sequence’s trace pattern $T$ is either $F@T_1@F$ when both branches’ patterns must be equal, or else is an or-pattern involving the trace $T_2$ from the else branch.

Rule T-LOOP imposes structural requirements on $I$ similar to T-IF. The
Subtyping

\begin{align*}
\ell \vdash t : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle; T \\
\Upsilon' \preceq \Upsilon'' \\
Sym' \preceq Sym''
\hline
\ell \vdash t : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon'', Sym'' \rangle; T
\end{align*}

Subtyping

Finally, Figure 3.9 presents subtyping rules. Rule T-SUB supports subtyping on the symbolic store and the label map. For the first, a symbolic store \( Sym \) can approximated by a store \( Sym' \) that either agrees on the symbolic values mapped to be \( Sym \) or maps them to \( ? \). For the second, a register’s security label can be approximated by one higher in the lattice; block labels may not change. Subtyping is important for typing join points after branches or loops. For example, if a conditional assigned a register \( r \) the value 1 in the true branch but assigned \( r \) to 2 in the false branch, we would use subtyping to map \( r \) to \( ? \) to ensure that the
symbolic store at the end of both branches agrees, as required by T-IF.

3.4.4 Security theorem

All well-typed programs are memory-trace oblivious:

**Theorem 2.** Given \( I, \Upsilon, \) and \( \text{Sym} \), if there exists some \( \Upsilon', \text{Sym}' \) and \( T \) such that \( L \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T \), where \( \forall r. \text{Sym}(r) = ? \) and \( \Upsilon(r) = L \) and \( \forall k. \text{Sym}(k) = ? \) and \( \Upsilon(k) = D \) then program \( I \) is memory-trace oblivious.

The proof can be found in the Appendix B.

3.5 Compilation

We have developed a compiler from an imperative, C-like source language, which we call \( \mathcal{L}_S \), to \( \mathcal{L}_{\text{GhostRider}} \). Our compiler is implemented in about 7600 lines of Java, with roughly 400 LoC dedicated to the parser, 700 LoC to the type checker, 3500 LoC to the compiler/optimizer, 950 LoC to the code generator, and the remainder to utility functions. This section informally describes our compilation approach.

3.5.1 Source Language

**Syntax** An \( \mathcal{L}_S \) program is a collection of (possibly mutually recursive) functions and a collection of (possibly mutually recursive) type definitions. A type definition is simply a mapping of a type name to a type where types are either natural numbers, arrays, or pointers to records (i.e., C-style **structs**). Each type is annotated with a
security label which is either secret or public indicating whether the data should be visible/inferrable by the adversary or not.

A function consists of a sequence of statements \( s \) which are either no-ops, variable assignments, array assignments, conditionals, while loops, or returns. As usual, conditional branches and loop bodies may consist of one or more statements. Expressions \( e \) appearing in statements (e.g., in assignments) consist of variables \( x \), arithmetic ops \( e_1 \ aop \ e_2 \), array reads \( e[e] \), and numeric constants \( n \). Variables may hold any data other than functions (i.e., there are no function pointers). Guards in conditionals and while loops consist of predicates involving relational operators.

**Typing** \( L_S \) programs are type checked before they are compiled. We do this using an information flow-style type system (cf. the survey of Sabelfeld and Myers [79]). As is standard, the type system prevents explicit flows and implicit flows. In particular, it disallows assignments like \( p = s \) where \( p \) is a public variable and \( s \) is a secret variable, and disallows conditionals like \( \text{if} \ (s == 0) \ \text{then} \ p = 0 \ \text{else} \ p = 1 \), which leaks information about \( s \) since after the conditional the adversary knows \( p == 0 \) implies \( s == 0 \). It also disallows array writes like \( p[s] = 5 \) since the adversary can learn the value of \( s \) by seeing which element of the public array has changed. Note that accessing \( s[p] \) is safe because, despite knowing the index, an adversary cannot learn the value being accessed.

To prevent the length of a memory trace from revealing information, we require that loop guard expressions only involve public values (which is a standard restriction [79]). One can work around this problem by “padding out” loop iterations, e.g., by converting a loop like
\[
\text{while} \ (\text{slen} > 0) \ \{ \ \text{sarr[slen--]}++; \ \}
\]
to be \( plen = N; \) \( \text{while } (plen > 0) \{ \text{if } (plen <= slen) \text{ sarr}[--plen]++; \} \)

where \( N \) is a large, fixed constant. For similar reasons we also require that whether a function is called or returned from, and which function it is, may not depend on secret information (e.g., the call or return may not occur in a conditional whose guard involves secret information).

**Compilation overview** After source-language type checking, compilation proceeds in four stages—memory layout, translation, padding, and register allocation—after which the result is type checked using the \( \mathcal{L}_{\text{GhostRider}} \) type system, to confirm that it is memory-trace oblivious.\(^3\)

### 3.5.2 Memory bank allocation

The first stage of compilation allocates global variables to memory banks. Public variables are always stored in RAM, while secret variables will be allocated either to ERAM or ORAM. Two blocks in the scratchpad are reserved for secret and public variables, respectively, that will fit entirely within the block; these are essentially those that contain numbers, (pointers to) records, and small arrays. Such variables will be loaded into the scratchpad at the start of executing a program, and written back to memory at the end. The remaining scratchpad blocks are used for handling (large) arrays; the compiler will always use the same block for the same array. Public arrays are allocated in RAM, and secret arrays always indexed by public values are allocated in ERAM, and ORAM otherwise. The compiler initially

\(^3\)This is essentially a kind of *translation validation* [73], which removes the compiler from the trusted computing base. We believe that well typed \( \mathcal{L}_S \) programs yield well typed \( \mathcal{L}_{\text{GhostRider}} \) programs, but leave a proof as future work.
assigns a distinct logical ORAM bank for each secret array, and allocates logical banks up to the hardware limit.

3.5.3 Basic compilation

The next stage is basic compilation (translation). Expressions are compiled by loading relevant variables/data into registers, performing the computation, and then storing back the result. Statements are compiled idiomatically to match the structure expected by the type rules in Figure 3.7, Figure 3.8, and Figure 3.9 (with some work deferred to the padding stage).

Perhaps the most interesting part is handling variable accesses. Variables permanently resident in the scratchpad are loaded at the start of the program, and stored back at the end. Each read/write results in a \texttt{ldw}, to load a variable into a temporary register, and a \texttt{stw} to store back the result. Accesses to data (i.e., arrays) not permanently stored in the scratchpad will also require a \texttt{ldb} to load the relevant block into the scratchpad first and likewise a \texttt{stb} to store it back. A standard software cache, rather than a scratchpad, could eliminate repeated loads and stores of blocks from memory but could violate MTO. This is because a non-present block will induce memory traffic while a present block will not, and the presence/absence of traffic could be correlated with secret information. To avoid this, we have the compiler emit instructions that perform caching explicitly, using the scratchpad, with caching only enabled when in a public context, i.e., in a portion of code whose control flow does not depend on secret data. To support software-based caching,
the compiler statically maps memory-resident data to particular scratchpad blocks, always loading the same data to the same block. Prior to doing so, and when safe, the compiler uses the \texttt{idb} instruction to check whether the relevant scratchpad block contains the memory block we want and loads directly from it, if so.

Supporting functions requires handling calling contexts and local variables. We do this with two stacks, one in RAM and one in ERAM. Function calls are only permitted in a public context, which means that normal stack allocation and deallocation reveal no information, so no ORAM stack is needed. When a function is called, the current scratchpad variable blocks are pushed on the relevant stacks. At the start of a function, we load the blocks that hold the local variables. Local variables implementing ORAM arrays are stored by reference, with the variable pointing to the actual array stored in ORAM. This array is deallocated when its variable is popped from the stack, when the function returns (which like calls are allowed only in a public context).

The compiler is also responsible for emitting instructions that load code into the instruction scratchpad, as implicit instruction fetches could reveal information. (To bootstrap, the first code block is loaded automatically.) At the moment, our compiler emits code that loads the entire program into the scratchpad at the start; we leave to future work support for on-the-fly instruction scratchpad use.
3.5.4 Padding and register allocation

Both branches of a secret conditional must produce the same trace. We ensure they do so by inserting extra instructions in one or both branches according to the solution to the shortest common supersequence problem [31]. When matching the two branches, we must account for the memory trace and instruction execution times. Only \texttt{ldb} and \texttt{stb} emit memory events; we discuss these shortly. While our formalism assumes each instruction takes unit time, the reality is different (cf. Table 3.2): times are deterministic, but non-uniform. For single-cycle operations (e.g., 64b ALU ops), we pad with \texttt{nops}. For two-cycle \texttt{ldw} and \texttt{stw} instructions, we pad with two \texttt{nops}. For multiply and divide instructions, which take 70 cycles each, we could pad with 70 \texttt{nops} but this results in a large space overhead. As such, we pad both with the instruction \( r0 \leftarrow r0 \times r0 \), where \( r0 \) is always 0. For conditionals, we pad the not-taken branch with two \texttt{nops}, to account for the hardware-induced delay on the taken branch.

Padding for \texttt{stb} and \texttt{ldb} requires instructions that generate matching trace events. An access to ORAM is the simplest to pad, since the adversary cannot distinguish a read from a write. We can load any block (e.g., the first block of the ORAM) into a dedicated “dummy” scratchpad block, i.e. this block is used for loading and saving dummy memory blocks only.

For RAM and ERAM, the address being accessed is visible, so we need to make sure that the equivalent padding accesses the same address. To do this, the compiler should insert further instructions to compute the address. These instructions can
be computed using the symbolic value: (1) if the symbolic value is a constant, then insert an assign instruction; (2) if the symbolic value is a binary operation of two symbolic values, then insert instructions to compute the two symbolic values respectively, and then another instruction to compute the binary operation; and (3) if the symbolic value is a memory value, then insert instructions to compute the offset first, and then insert a \texttt{ldw} instruction.

With instructions inserted to compute the address, we must emit either a load or a store depending on the instruction we are trying to match. For RAM, this instruction will always be a load because we perform padding in the H context, and the type system prevents writing to RAM. To mimic the \texttt{read}(l, k, sv) trace pattern, we first compute \(sv\) and then insert a \texttt{ldb} \(k \leftarrow l[r]\) instruction where \(r\) stores the value for \(sv\). To handle ERAM writes is challenging because we want the write to be a no-op but not appear to be so. To do this, we require the compiler to \textit{always} follow an ERAM \texttt{ldb} with a \texttt{stb} back to the same address. In doing so, the compiler also prevents the padded instruction from overwriting a dirty scratchpad block.

At the conclusion of the padding stage we perform standard register allocation to fill in actual registers for the temporaries we have used to this point.

### 3.6 Hardware Implementation

We implement our deterministic processor by modifying Rocket, a single-issue, in-order, 6-stage pipelined CPU developed at UC Berkeley [77]. Rocket implements the RISC-V instruction set [93] and is comparable to an ARM Cortex A5 CPU.
We modified the baseline processor to remove branch prediction logic (so that conditional branches are always not-taken) and to make each instruction execute in a fixed number of cycles. We describe the remaining changes below.

**Instruction-set Extension.** We customize RISC-V to add a single data transfer instruction that implements `ldb` and `stb` from the formalism. We do this using a Data Transfer accelerator (Figure 3.2) that attaches to the processor’s accelerator interface [88]. We also interface the Data Transfer accelerator with the x86-Linux host through Rocket’s control register file so that it can load an elf-formatted binary into GhostRider’s memory and reset its processor. Once this is done, the host performs processor control register writes to initiate transfers from the co-processor memory to the code ORAM for the code and data sections of the binary. The first code block of a program is loaded into the instruction scratchpad to begin execution; if subsequent instruction blocks are needed they must be loaded explicitly.

**Scratchpads.** GhostRider has two scratchpads, one for code and one for data, each of which can hold eight 4KB blocks. The instruction scratchpad is implemented similar to an 8-way set-associative cache, where each way contains one block. The accelerator transfers one block at a time to a specified way in the instruction scratchpad. Once a block has been written, the valid and tag bits for that block are updated. The architecture does not implement the `idb` instruction from the formalism; instead, the compiler uses the first 8 bytes of every block to remember its address.

**ORAM controller.** We implement ORAM by building on the Phantom ORAM controller [63] and implement an ORAM tree 13 levels deep (i.e., $2^{12}$ leaf buckets),
with 4 blocks per bucket and an effective capacity of 64MB. ORAM controllers include an on-chip \textit{stash} to temporarily buffer ORAM blocks before they are written out to memory. We set this stash to be 128 blocks. The Phantom design (and likewise, Ascend’s [27–29]) treats the stash as a cache for ORAM lookups, which is safe when handling timing channels by controlling the memory access rate. GhostRider mitigates timing channels by having the compiler enforce MTO while assuming that events take the same time. As such, we modify Phantom’s design to generate an access to a random leaf in case the requested block is found in the stash, to ensure uniform access times.

\textbf{FPGA Implementation.} GhostRider is implemented on one of Convey HC-2ex’s [21] four Xilinx Virtex-6 LX760 FPGAs. We measure hardware design size in terms of FPGA \textit{slices} for logic and \textit{Block RAMs} for on-chip memory. A slice comprises four 6-input, 2-output lookup tables (implementing configurable logic) and eight flip-flops (as storage elements) in addition to multiplexers, while each BRAM on Virtex-6 is either an 18Kb or 36Kb SRAM with up to two configurable read-write ports. The GhostRider prototype uses 47,357 such slices (39% of total) to implement both the CPU and the ORAM controller, and requires 685 of 1440 18Kb BRAMs (47.5%). Table 3.1 shows how these resources are broken up between the Rocket CPU and the ORAM controller, with the remaining resources being used by Convey HC-2ex’s boilerplate logic to interface with the x86 core and DRAM. Note that this breakdown is a synthesis estimate before place and route.

Our prototype currently supports one data ORAM bank, one code ORAM bank, and one ERAM bank. We do not implement encryption (it is a small, fixed
cost and uninteresting in terms of performance trends), and do not have separate DRAM; all public data is stored in ERAM when running on the hardware.

The Convey machine requires the hardware design to be run at 150 MHz while our ORAM controller prototype currently synthesizes to a maximum operating frequency of 140MHz. Pending further optimization to meet 150 MHz timing, we run both the CPU and the ORAM controller in a 75 MHz clock domain, and use asynchronous FIFOs to connect the ORAM controller to the DDR DRAM controllers.

**GhostRider simulator timing model.** In addition to demonstrating feasibility with our hardware prototype, we study the effect of GhostRider’s compiler on alternate, more efficient ORAM configurations, e.g., Phantom at 150MHz [63] with two ORAM banks and a distinct (non-encrypting) DRAM bank. Hence we generate a timing model for both the modified processor and ORAM banks based on Phantom’s hardware implementation [63], and incorporate the timing model into an ISA-level emulator for the RISC-V architecture; the model is shown in Table 3.2.

<table>
<thead>
<tr>
<th></th>
<th>Slices</th>
<th>BRAMs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Rocket</td>
<td>9287 (8.8%)</td>
<td>36 (10.5%)</td>
</tr>
<tr>
<td>ORAM</td>
<td>12845 (12.2%)</td>
<td>211 (61.5%)</td>
</tr>
</tbody>
</table>

Table 3.1: FPGA synthesis results on Convey HC-2ex.

### 3.7 Empirical Evaluation

**Programs.** Table 3.3 lists all the programs we use in our evaluation. These programs range from standard algorithms to data structures and include predictable,
<table>
<thead>
<tr>
<th>Feature</th>
<th>Latency (# cycles)</th>
</tr>
</thead>
<tbody>
<tr>
<td>64b ALU</td>
<td>1</td>
</tr>
<tr>
<td>Jump taken/not taken</td>
<td>3/1</td>
</tr>
<tr>
<td>64b Multiply/Divide</td>
<td>70/70</td>
</tr>
<tr>
<td>Load/Store from Scratchpad</td>
<td>2</td>
</tr>
<tr>
<td>DRAM (4kB access)</td>
<td>634</td>
</tr>
<tr>
<td>Encrypted RAM (4kB access)</td>
<td>662</td>
</tr>
<tr>
<td>ORAM 13 levels (4kB block)</td>
<td>4262</td>
</tr>
</tbody>
</table>

Table 3.2: Timing model for GhostRider simulator.

Figure 3.10: Simulator-based execution time results of GhostRider.

partially predictable, and predominantly irregular (data-driven) memory access patterns.

Execution time results. We present measurements both for the simulator and for the actual FPGA hardware, starting with the former because the simulator allows us to evaluate the benefits from splitting memory into ERAM and ORAM banks.
<table>
<thead>
<tr>
<th>Name</th>
<th>Brief Description</th>
<th>Input Size (KB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>sum</td>
<td>Summing up all positive elements in an array</td>
<td>$10^3$</td>
</tr>
<tr>
<td>findmax</td>
<td>Find the max element in an array</td>
<td>$10^3$</td>
</tr>
<tr>
<td>heappush</td>
<td>Insert an element into a min-heap</td>
<td>$10^3$</td>
</tr>
<tr>
<td>perm</td>
<td>Computing a permutation executing $a[b[i]] = i$ for all $i$</td>
<td>$10^3$</td>
</tr>
<tr>
<td>histogram</td>
<td>Compute the number of occurrences of each last digit</td>
<td>$10^3$</td>
</tr>
<tr>
<td>dijkstra</td>
<td>Single-source shortest path</td>
<td>$10^3$</td>
</tr>
<tr>
<td>search</td>
<td>Binary search algorithm</td>
<td>$1.7 \times 10^4$</td>
</tr>
<tr>
<td>heappop</td>
<td>Pop the minimal element from a min-heap</td>
<td>$1.7 \times 10^4$</td>
</tr>
</tbody>
</table>

Table 3.3: Benchmark programs for GhostRider organized into programs with predictable, partially predictable, and data dependent memory access patterns (in order from top).

Non-secure
- Non-secure program: all variables in ERAM, no padding, and uses scratchpad.

Baseline
- Secure baseline: all secret variables in a single ORAM, no scratchpad.

Split ORAM
- Variables can be split across multiple ORAM banks, or placed in ERAM. Performs padding. No scratchpad.

Final
- Scratchpad on top of Split ORAM.

Figure 3.11: Legends of Figure 3.10

v. additionally using a scratchpad. We also discuss the execution time results by categorizing them based on the regularity in the programs’ access patterns.

**Simulator-based results.** Figure 3.10 depicts the slowdown of various configurations relative to a non-secure configuration that simply stores data in ERAM and employs the scratchpad. The legends are explained in Figure 3.11. Our non-secure baseline uses a scratchpad instead of a hardware cache in order to isolate the cost of MTO/ORAM. The secure Baseline configuration places all secret variables in a single ORAM, while Split ORAM employs the GhostRider optimization of using
ERAM and multiple ORAM banks, and Final further adds the (secure) use of a scratchpad.

Three out of eight programs—sum, findmax, and heap-push—have a predictable access pattern and the secure program generated by GhostRider relies mainly on ERAM. Hence, each MTO program (Final) has almost no slowdown to $3.08 \times$ slowdown in comparison its non-secure counterpart (Non-secure), and correspondingly faster than Baseline by $5.85 \times$ to $9.03 \times$.

For perm, histogram, and dijkstra, which have partially predictable and partially sensitive memory access patterns, our compiler attempts to place sensitive arrays inside both ERAM and ORAM and also favors splitting into several smaller ORAM banks without breaking MTO. As shown in Figure 3.10, for such programs, Final can achieve a $1.30 \times$ to $1.85 \times$ speedup over Baseline (with $7.56 \times$ to $10.68 \times$ slowdown compared to Non-secure, respectively).

For search and heappop, which have predominantly sensitive memory access patterns, the speedup of Final over Baseline is not as significant, i.e. $1.07 \times$ and $1.12 \times$ respectively, and is due mostly to the usage of two ORAMs to store arrays instead of a single ORAM.

Examining the impact of the use of the scratchpad in the results, we can see that for the first six programs, Final reduces execution time compared to Split ORAM by a factor from $1.05 \times$ up to $2.23 \times$. For search and heappop, the scratchpad provides no benefit because for these programs all data is allocated in ORAM, as array indices are secret (so the access pattern is sensitive), and our type system disallows caching of ORAM blocks. The reason is that the presence of the data in
Figure 3.12: FPGA based execution time results: Slowdown of **Baseline** and **Final** versions compared to non-secure version of the program. Note that unlike Figure 3.10, **Final** uses only a single ORAM bank and conflates ERAM and DRAM (cf. Section 3.6).

The cache could reveal something about the secret indices. A more sophisticated type system, or a relaxation of MTO, could do better; we plan to explore such improvements in future work.

**FPGA-based results.** For the FPGA we run the same set of programs as in Table 3.3, but restrict the input size to be around 100 KB, due to limitations of our prototype. Speedups of **Final** over the secure **Baseline** follow a trend similar to the simulator, as shown in Figure 3.12. Regular programs have speedups in the range of $4.33 \times$ (for **heappush**) to $8.94 \times$ (for **findmax**). Partially regular programs like **perm** and **histogram** get a speedup of $1.46 \times$ and $1.3 \times$ respectively. Finally, irregular programs such as **search** and **heappop** see very little improvements ($1.08 \times$ and $1.02 \times$ respectively).
Differences between the simulator and hardware numbers can be attributed to multiple factors. First, the simulator imperfectly models the Convey memory system’s latency, always assuming the worst case, and thus slowdowns compared to the non-secure baseline are often worse on the simulator (cf. `heappop` and `heappush`).

Second, the timing of certain hardware operations is different on the prototype and the simulator (where we consider the latter to be aspirational, per the end of Section 3.6). In particular, per Table 3.2, the simulator models access latency for ORAM as 4262 cycles and ERAM as 662 cycles, accounting for both reading data blocks from DRAM and moving the chosen 4KB block into the scratchpad BRAMs on the FPGA. On the hardware, ORAM and ERAM latencies are 5991 and 1312 cycles, respectively, measured using performance counters in the hardware design. The higher ERAM and ORAM access times reduce the slowdown on the simulator by amplifying the benefit of the scratchpad, which is used by the non-secure baseline, but not by the secure baseline (cf. `findmax` and `sum`).

Third, the benefit of using the scratchpad can differ depending on the input size. This effect is particularly pronounced for Dijkstra, where the ratio of secure to non-secure baseline execution is smaller for the hardware than for the simulator. The reason is that the hardware experiment uses a smaller input that fills only about 1/5 of a scratchpad block. Hence, in the non-secure baseline, the block is reloaded after relatively fewer accesses, resulting in a relatively greater number of block loads and thus bringing the performance of the non-secure program closer to that of the secure baseline.

Finally, note that the simulator’s use of multiple ORAM banks, and DRAM
with different timings, is a source of differences, but this effect is dwarfed by the other effects.

3.8 Conclusion

We have presented the first complete memory trace oblivious system—GhostRider—comprising of a novel compiler, type system, and hardware architecture. The compiled programs not only provably satisfy memory trace obliviousness, but also exhibit up to nearly order-of-magnitude performance gains in comparison with placing all variables in a single ORAM bank. By enabling compiler analyses to target a joint ERAM-ORAM memory system, and by employing a compiler-controlled scratchpad, this work opens up several performance optimization opportunities in tuning bank configurations (size and access granularity) and, on a broader level, into co-designing data structures and algorithms for a heterogeneous yet oblivious memory hierarchy.
Chapter 4: RAM-model Secure Computation

Secure computation is a cryptographic technique allowing mutually distrusting parties to make collaborative use of their local data without harming privacy of their individual inputs. Since the first system for general-purpose secure two-party computation was built in 2004 [65], efficiency has improved substantially [11, 46].

Almost all previous implementations of general-purpose secure computation assume the underlying computation is represented as a circuit. While theoretical developments using circuits are sensible (and common), typical programs are mostly expressed in von Neumann-style Random Access Machine (RAM) model. Compiling a RAM-model program into its efficient circuit-based representation can be challenging, especially when handling dynamic memory accesses to an array in which the memory location being read/written depends on secret inputs. Existing program-to-circuit compiler typically makes an entire copy of the array upon every dynamic memory access, thus resulting in a huge circuit when the data size is large.

To address this limitations, recent research [38] shows that secure computation ORAM can be used to compile a dynamic memory access into a circuit with poly-logarithmic size while preventing information leakage through memory-access patterns. We refer such an approach as a RAM-model secure computation (RAM-
SC) approach, and Gordon et al. [38] observed that RAM-SC exhibits a significant advantage in the setting of repeated sublinear-time queries (e.g., binary search) on a large database, where an initial setup cost can be amortized over subsequent repeated queries.

**Our Contributions.** We continue work on secure computation in the RAM model, with the goal of providing a complete system that takes a program written in a high-level language and compiles it to a protocol for secure two-party computation of that program.\(^1\) In particular, we

- Define an intermediate representation (which we call SCVM) suitable for efficient two-party RAM-model secure computation;

- Develop a *type system* ensuring that any well-typed program will generate a RAM-SC protocol secure in the semi-honest model, if all subroutines are implemented with a protocol secure in the semi-honest model.

- Build an *automated compiler* that transforms programs written in a high-level language into a secure two-party computation protocol, and integrate compile-time optimizations crucial for improving performance.

We use our compiler to compile several programs including Dijkstra’s shortest-path algorithm, KMP string matching, binary search, and more. For moderate data sizes (up to the order of a million elements), our evaluation shows a speedup of 1–2 orders of magnitude as compared to standard circuit-based approaches for securely

\(^1\)Note that Gordon et al. [38] do not provide such a compiler; they only implement RAM-model secure computation for the particular case of binary search.
computing these programs. We expect the speedup to be even greater for larger
data sizes.

**SCVM** is our first attempt to demonstrate the feasibility to optimize secure
computation in the RAM-model using ORAMs. In the next Chapter, we extend
**SCVM** to build **OblivM** which focuses more on richer expressiveness power and
easy-programmability while achieving the state-of-the-art performance for secure
computation.

This chapter is based on a paper that I co-authored with Michael Hicks, Yan
Huang, Jonathan Katz, and Elaine Shi [59]. I developed the formalism and the
proof under the help of Michael Hicks, Elaine Shi and Jonathan Katz. I developed
the compiler, which implements the optimization and the type checker, and emits
code that is runnable over a secure computation backend. I conducted experiments
to show the compiler’s effectiveness with the help of Yan Huang.

4.1 Technical Highlights

As explained in Sections 4.2 and 4.3, the standard implementation of RAM-SC
entails placing all data and instructions inside a single Oblivious RAM. The secure
evaluation of one instruction then requires *i*) fetching instruction and data from
ORAM; and *ii*) securely executing the instruction using a universal next-instruction
circuit (similar to a machine’s ALU). This approach is costly since each step must
be done using a secure-computation sub-protocol.

**An efficient representation for RAM-SC.** Our type system and **SCVM** inter-
mediate representation are capable of expressing RAM-SC tasks more efficiently by avoiding expensive next-instruction circuits and minimizing ORAM operations when there is no risk to security. These language-level capabilities allow our compiler to apply compile-time optimizations that would otherwise not be possible. Thus, we not only obtain better efficiency than circuit-based approaches, but we also achieve order-of-magnitude performance improvements in comparison with straightforward implementations of RAM-SC (see Section 4.2).

**Program-trace simulatability.** A well-typed program in our language is guaranteed to be both *instruction-trace oblivious* and *memory-trace oblivious*. Instruction-trace obliviousness ensures that the values of the program counter during execution of the protocol do not leak information about secret inputs other than what is revealed by the output of the program. As such, the parties can avoid securely evaluating a universal next-instruction circuit, but can instead simply evaluate a circuit corresponding to the current instruction. Memory-trace obliviousness ensures that memory accesses observed by one party during the protocol’s execution similarly do not leak information about secret inputs other than what is revealed by the output. In particular, if access to some array does not depend on secret information (e.g., it is part of a linear scan of the array), then the array need not be placed into ORAM.

We formally define the security property ensured by our type system as *program-trace simulatability*. We define a mechanism for compiling programs to protocols that rely on certain ideal functionalities. We prove that if every such ideal functionality is instantiated with a semi-honest secure protocol computing that functionality, then any well-typed program compiles to a semi-honest secure protocol computing that
Additional language features. SCVM supports several other useful features. First, it permits reactive computations by allowing output not only at the end of the program’s execution, but also while it is in progress. Our notation of program-trace simulatability also fits this reactive model of computation.

SCVM also integrates state-of-the-art optimization techniques that have been suggested previously in the literature. For example, we support public, local, and secure modes of computation, a technique recently explored (in the circuit model) by Kerschbaum [52] and Rastogi et al. [76] Our compiler can identify and encode portions of computation that can be safely performed in the clear or locally by one of the parties, without incurring the cost of a secure-computation sub-protocol.

Our SCVM intermediate representation generalizes circuit-model approaches. For programs that do not rely on ORAM, our compiler effectively generates an efficient circuit-model secure-computation protocol. This paper focuses on the design of the intermediate representation language and type system for RAM-model secure computation, as well as the compile-time optimization techniques we apply. Our work is complementary to several independent, ongoing efforts focused on improving the cryptographic back end.

4.2 Background: RAM-Model Secure Computation

In this section, we review some background for RAM-model secure computation. Our treatment is adapted from that of Gordon et al. [38], with notation
adjusted for our purposes.

A key underlying building block of RAM-model secure computation is Oblivious RAM (ORAM). ORAM is a cryptographic primitive that hides memory-access patterns by randomly reshuffling data in memory. With ORAM, each memory read or write operation incurs poly log \( n \) actual memory accesses.

Existing RAM-model secure computation, which we refer as straightforward RAM-SC, employs the following scheme. The entire memory denoted \( \text{mem} \), containing both program instructions and data, is placed in ORAM, and the ORAM is secret-shared between the two participating parties as discussed above, e.g., using a simple XOR-based secret-sharing scheme. With ORAM, a memory access thus requires each party to access the elements of their respective arrays at pseudorandom locations (the addresses are dictated by the ORAM algorithm), and the value stored at each position is then obtained by XORing the values read by each of the parties. Alternatively, the server can hold an encryption of the ORAM array, and the client holds the key. The latter was done by Gordon et al. to ensure that one party holds only \( O(1) \) state. All CPU states are also secret-shared between the two parties.

Straightforward RAM-SC proceeds as follows. Each step of the computation must be done using some secure computation subprotocol. In particular, SC-U is a secure computation protocol that securely evaluates the universal next instruction circuit, and SC-ORAM is a secure computation protocol that securely evaluates the ORAM algorithm. For ORAM.Read, each party supplies a secret share of the \( \text{raddr} \), and during the course of the protocol, the ORAM.Read protocol will emit obfuscated
### Scenario | Potential benefits of RAM-model secure computation
--- | ---
1. Repeated sublinear queries over a large dataset (e.g., binary search, range query, shortest path query) | - Amortize preprocessing cost over multiple queries  
- Achieve *sublinear* amortized cost per query
2. One-time computation over a large dataset | Avoid paying $O(n)$ cost per dynamic memory access

Table 4.1: Two main scenarios and advantages of RAM-model secure computation

Deployment scenarios and threat model for RAM-model secure computation. SCVM presently supports a two-party semi-honest protocol. We consider the following primary deployment scenarios:

1. Two parties, Alice and Bob, each comes with their own private data, and engage in a two-party protocol. For example, Goldman Sachs and Bridgewater would like to perform joint computation over their private market research data to learn market trends.

2. One or more users break their private data (e.g., genomic data) into secret shares, and split the shares among two non-colluding cloud providers. The shares at each cloud provider are completely random and reveal no information. To perform computation over the secret-shared data, the two cloud
providers engage in a secure 2-party computation protocol.

3. Similar as the above, but the two servers are within the same cloud or under the same administration. This can serve to mitigate Advanced Persistent Threats or insider threats, since compromise of a single machine will no longer lead to the breach of private data. Similar architectures have been explored in commercial products such as RSA’s distributed credential protection [3].

In the first scenario, Alice and Bob should not learn anything about each other’s data besides the outcome of the computation. In the second and third scenarios, the two servers should learn nothing about the users’ data other than the outcome of the computation – note that the outcome of the computation can also be easily hidden simply by XORing the outcome with a secret random mask (like a one-time pad). We assume that the program text (i.e., code) is public.

With respect to the types of applications, while Gordon et al. describe RAM-model secure computation mainly for the amortized setting, where repeated computations are carried out starting from a single initial dataset, we note that RAM-model secure computation can also be meaningful for one-time computation on large datasets, since a straightforward RAM-to-circuit compiler would incur linear (in the size of dataset) overhead for every dynamic memory access whose address depends on sensitive inputs. Table 4.1 summarizes the two main scenarios for RAM-model secure computation, and potential advantages of using the RAM model in these cases.
4.3 Technical Overview: Compiling for RAM-Model Secure Computation

This section describes our approach to optimize RAM-model secure computation. Our key idea is use static analysis during compilation to minimize the use of heavyweight cryptographic primitives such as garbled circuits and ORAM.

4.3.1 Instruction-Trace Obliviousness

The standard RAM-model secure computation protocol described in Section 4.2 is relatively inefficient because it requires a secure-computation sub-protocol to compute the universal next-instruction circuit $U$. This circuit has large size, since it must interpret every possible instruction. In our solution, we will avoid relying on a universal next-instruction circuit, and will instead arrange things so that we can securely evaluate instruction-specific circuits.

Note that it is not secure, in general, to reveal what instruction is being carried out at each step in the execution of some program. As a simple example, consider a branch over a secret value $s$:

$$\text{if}(s) \: x[i] := a+b; \ \text{else} \: x[i] := a-b$$

Depending on the value of $s$, a different instruction (i.e., add or subtract) will be executed. To mitigate such an implicit information leak, our compiler transforms a program to an instruction-trace oblivious counterpart, i.e., a program whose program-counter value (which determines which instruction will be executed next)
does not depend on secret information. The key idea there is to use a **mux** operation to rewrite a secret if-statement. For example, the above code can be re-factored to the following:

```plaintext
t1 := s;
t2 := a+b;
t3 := a-b;
t4 := mux(t1, t2, t3);
x[i] := t4
```

At every point during the above computation, the instruction being executed is pre-determined, and so does not leak information about sensitive data. Instruction-trace obliviousness is similar to *program-counter security* proposed by Molnar et al. [67] (for a different application).

### 4.3.2 Memory-Trace Obliviousness

Using ORAM for memory accesses is also a heavyweight operation in RAM-model secure computation. The standard approach is to place *all* memory in a single ORAM, thus incurring $O(\text{poly log } n)$ cost per data operation, where $n$ is a bound on the size of the memory.

We have demonstrated in the context of securing remote execution against physical attacks (Chapter 2,3) that not all access patterns of a program are sensitive. For example, a **findmax** program that sequentially scans through an array to find the maximum element has predictable access patterns that do not depend on sensitive
inputs. We propose to apply a similar idea to the context of RAM-model secure computation. Our compiler performs static analysis to detect safe memory accesses that do not depend on secret inputs. In this way, we can avoid using ORAM when the access pattern is independent of sensitive inputs. It is also possible to store various subsets of memory (e.g., different arrays) in different ORAMs, when information about which portion of memory (e.g., which array) is being accessed does not depend on sensitive information.

4.3.3 Mixed-Mode Execution

We also use static analysis to partition a program into code blocks, and then for each code block use either a public, local, or secure mode of execution (described next). Computation in public or local modes avoids heavyweight secure computation. In the intermediate language, each statement is labeled with its mode of execution.

Public mode. Statements computing on publicly-known variables or variables that have been declassified in the middle of program execution can be performed by both parties independently, without having to resort to a secure-computation protocol. Such statements are labeled P. For example, the loop iterators (in lines 1, 3, 10) in Dijkstra’s algorithm (see Figure 4.2) do not depend on secret data, and so each party can independently compute them.

Local mode. For statements computing over Alice’s variables, public variables, or previously declassified variables, Alice can perform the computation independently
for(i = 0; i < n; ++i) {
    int bestj = -1; bestdis = -1;
    for(int j=0; j<n; ++j) {
        if( ! vis[j] && (bestj < 0 || dis[j] < bestdis))
            bestj = j;
        bestdis = dis[j];
    }
    vis[bestj] = 1;
    for(int j=0; j<n; ++j) {
        if( !vis[j] && (bestdis + e[bestj][j] < dis[j]))
            dis[j] = bestdis + e[bestj][j];
    }
}

Figure 4.1: Dijkstra’s shortest distance algorithm in source (Part)

without interacting with Bob (and vice versa). Here we crucially rely on the fact
that we assume semi-honest behavior. Alice-local statements are labeled A, and
Bob-local statements are labeled B.

Secure mode. All other statements that depend on variables that must be kept
secret from both Alice and Bob will be computed using secure computation, making
ORAM accesses along the way if necessary. Such statements are labeled O (for
“oblivious”).

4.3.4 Example: Dijkstra’s Algorithm

In Figure 4.2, we present a complete compilation example for part of Dijkstra’s
algorithm in Figure 4.1. Here one party, Alice, has a private graph represented
by a pairwise edge-weight array e[ ][ ] and the other party, Bob, has a private
source/destination pair. Bob wishes to compute the shortest path between his source
Figure 4.2: Compilation example: Part of Dijkstra’s shortest-path algorithm. The code on the left is compiled to the annotated code on the right. Array variable $e$ is Alice’s local input array containing the graph’s edge weights; Bob’s input, a source/destination pair, is not used in this part of the algorithm. Array variables $vis$ and $orame$ are placed in ORAMs. Array variable $dis$ is placed in non-oblivious (but secret-shared) memory. (Prior to the shown code, $vis$ is initialized to all zeroes except that $vis[\text{source}]$—where $\text{source}$ is Bob’s input—is initialized to 1, and $dis[i]$ is initialized to $e[\text{source}][i]$. Variables $n$, $i$, $j$ and others boxed in white background are public variables. All other variables are secret-shared between the two parties.
and destination in Alice’s graph. The figure shows the code that computes shortest paths (Bob’s inputs are elided).

Our specific implementation of Dijkstra’s algorithm uses three arrays, a dis array which keeps track of the current shortest distance from the source to any other node; an edge-weight array orame which is initialized by Alice’s local array e, and an indicator array vis, denoting whether each node has been visited. In this case, our compiler places arrays vis and e in separate ORAMs, but does not place array dis in ORAM since access to dis always follows a sequential pattern.

Note that parts of the algorithm can be computed publicly. For example, all the loop iterators are public values; therefore, loop iterators need not be secret-shared, and each party can independently compute the current loop iteration. The remaining parts of program all require ORAM accesses; therefore, our compiler annotates these instructions to be run in secure mode, and generates equivalent instruction- and memory-trace oblivious target code.

4.4 SCVM Language

This section presents SCVM, our language for RAM-model secure computation, and presents our formal results.

In Section 4.4.1, we present SCVM’s formal syntax. In Section 4.4.2, we give a formal, ideal world semantics for SCVM that forms the basis of our security theorem. Informally, each party provides their inputs to an ideal functionality \( F \) that computes the result and returns to each party its result and a trace of events it is
allowed to see; these events include instruction fetches, memory accesses, and declassification events, which are results computed from both parties’ data. Section 4.4.3 formally defines our security property, \( \Gamma\text{-simulatability} \). Informally, a program is secure if each party, starting with its own inputs, memory, the program code, and its trace of declassification events, can simulate (in polynomial time) its observed instruction traces and memory traces without knowing the other party’s data. We present a type system for SCVM programs in Section 4.4.4, and in Theorem 3 prove that well-typed programs are \( \Gamma\text{-simulatable} \). Theorem 4 additionally shows that well-typed programs will not get stuck, e.g., because one party tries to access memory unavailable to it. Finally, in Section 4.4.5 we define a hybrid world functionality that more closely models SCVM’s implemented semantics using ORAM, garbled circuits, etc. and prove that for \( \Gamma\text{-simulatable} \) programs, the hybrid-world protocol securely implements the ideal functionality. The formal results are summarized in Figure 4.3.
4.4.1 Syntax

The syntax of SCVM is given in Figure 4.4. In SCVM, each variable and statement has a security label from the lattice \( \{P, A, B, O\} \), where \( \sqsubseteq \) is defined to be the smallest partial order such that \( P \sqsubseteq l \sqsubseteq O \) for \( l \in \{A, B\} \). The label of each variable indicates whether its memory location should be public, known to either Alice or Bob (only), or secret. For readability, we do not distinguish between oblivious secret arrays and non-oblivious secret arrays at this point, and simply assume that all secret arrays are oblivious. Support for non-oblivious, secret arrays will be added in Section 4.5.

An information-flow control type system, which we discuss in Section 4.4.4, enforces that information can only flow from low (i.e., lower in the partial order) security variables to high security variables. For example, for a statement \( x := y \) to be secure, \( y \)'s security label should be less than or equal to \( x \)'s security label. An exception is the declassification statement \( x := \text{declass}_l(y) \) which may declassify a variable \( y \) labeled \( O \) to a variable \( x \) with lower security label \( l \).

The label of each statement indicates the statement’s mode of execution. A statement with the label \( P \) is executed in public mode, where both Alice and Bob can see its execution. A statement with the label \( A \) or \( B \) is executed in local mode, and is visible to only Alice or Bob, respectively. A statement with the label \( O \) is executed securely, so both Alice and Bob know the statement was executed but do not learn the underlying values that were used.

Most SCVM language features are standard. We highlight the statement \( x := \)
Variables $x, y, z \in \mathbf{Vars}$

Security Labels $l \in \mathbf{SecLabels} = \{P, A, B, O\}$

Numbers $n \in \mathbf{Nat}$

Operation $op ::= + | - | ...$

Expressions $e ::= x | n | x op x | x[x] | \text{mux}(x, x, x)$

Statements $s ::= \text{skip} | x := e | x[x] := x | \text{if } (x) \text{ then } S \text{ else } S |$

$Labeled\, Statements\, S ::= l \; S | S ; S$

Figure 4.4: Syntax of SCVM

\text{oram}(y)$, by which variable $x$ is assigned to an ORAM initialized with array $y$’s contents, and the expression $\text{mux}(x_0, x_1, x_2)$, which evaluates to either $x_1$ or $x_2$, depending on whether $x_0$ is 0 or 1.

4.4.2 Semantics

We define a formal semantics for SCVM programs which we think of as defining a computation carried out, on Alice and Bob’s behalf, by an ideal functionality $\mathcal{F}$. However, as we foreshadow throughout, the semantics is endowed with sufficient structure that it can be interpreted as using the mechanisms (like ORAM and garbled circuits) described in Sections 4.3. We discuss such a hybrid world interpretation more carefully in Section 4.4.5 and prove it also satisfies our security properties.

Memories and types. Before we begin, we consider a few auxiliary definitions given in Figure 4.5. A memory $M$ is a partial map from variables to value-label pairs. The value is either a natural number $n$ or an array $m$, which is a partial map from naturals to naturals. The security labels $l \in \{P, A, B, O\}$ indicate the conceptual
visibility of the value as described earlier. Note that in a real-world implementation, 
data labeled 0 is stored in ORAM and secret-shared between Alice and Bob, while 
other data is stored locally by Alice or Bob. We sometimes find it convenient to 
project memories whose values are visible at particular labels:

**Definition 6** (L-projection). Given memory $M$ and a set of security labels $L$, 
we write $M[L]$ as $M$’s $L$-projection, which is itself a memory such that for all 
x, $M[L](x) = (v, l)$ if and only if $M(x) = (v, l)$ and $l \in L$.

We define types $\text{Nat } l$ and $\text{Array } l$, for numbers and arrays, respectively, 
where $l$ is a security label. A *type environment* $\Gamma$ associates variables with types, 
and we interpret it as a partial map. We sometimes consider when a memory is 
consistent with a type environment $\Gamma$:

**Definition 7** (Γ-compatibility). We say a memory $M$ is $\Gamma$-compatible if and only 
if for all $x$, when $M(x) = (v, l)$, then $v \in \text{Nat } \Leftrightarrow \Gamma(x) = \text{Nat } l$ and $v \in \text{Array } \Leftrightarrow \Gamma(x) = \text{Array } l$.

**Ideal functionality.** Once Alice and Bob have agreed on a program $S$, we imagine 
an ideal functionality $F$ that executes $S$. Alice and Bob send to $F$ memories $M_A$ 
and $M_B$, respectively. Alice’s memory contains data labeled A and P, while Bob’s 
memory contains data labeled B and P. (Data labeled 0 is only constructed during 
execution.) $F$ then proceeds as follows:

1. It checks that $M_A$ and $M_B$ agree on P-labeled values, i.e., that $M_A[\{P\}] = M_B[\{P\}]$. It also checks that they do not share any A/B-labeled values, i.e.,
that the domain of $M_A[\{A\}]$ and the domain of $M_B[\{B\}]$ do not intersect. If either of these conditions fail, $\mathcal{F}$ notifies both parties and aborts the execution. Otherwise, it constructs memory $M$ from $M_A$ and $M_B$:

$$M = \{x \mapsto (v, l) \mid M_A[\{A, P\}](x) = (v, l) \lor M_B[\{B\}](x) = (v, l)\}$$

2. $\mathcal{F}$ executes $S$ according to semantics rules having the form $\langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', S' \rangle : D$. This judgment states that starting in memory $M$, statement $S$ runs, producing a new memory $M'$ and a new statement $S'$ (representing the partially executed program) along with instruction traces $i_a$ and $i_b$, memory traces $t_a$ and $t_b$, and declassification event $D$. We discuss these traces/events shortly. The ideal execution will produce one of three outcomes (or fail to terminate):

- $\langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', S' \rangle : D$, where $D = (d_a, d_b)$. In this case, $\mathcal{F}$ outputs $d_a$ to Alice, and $d_b$ to Bob. Then $\mathcal{F}$ sets $M$ to $M'$ and $S$ to $S'$ and restarts step 2.

- $\langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', l : \text{skip} \rangle : \epsilon$. In this case, $\mathcal{F}$ notifies both parties that computation finished successfully.

- $\langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', S' \rangle : \epsilon$, where $S' \neq l : \text{skip}$, and no rules further reduce $\langle M', S' \rangle$. In this case, $\mathcal{F}$ aborts and notifies both parties.

Notice that the only communications between $\mathcal{F}$ and each party about the computation are declassifications $d_a$ and $d_b$ (to Alice and Bob, respectively) and notification
Figure 4.5: Auxiliary syntax and functions for SCVM semantics

of termination. This is because we assume that secure programs will always explicitly declassify their final output (and perhaps intermediate outputs, e.g., when processing multiple queries), while all other variables in memory are not of consequence. The memory and instruction traces, though not explicitly communicated by \( \mathcal{F} \), will be visible in a real implementation (described later), but we prove that they provide no additional information beyond that provided by the declassification events.
Traces and events. The formal semantics incorporate the concept of traces to define information leakage. There are three types of traces, all given in Figure 4.5.

The first is an instruction trace $i$. The instruction trace generated by an assignment statement is the statement itself (e.g., $x := e$); the instruction trace generated by a branching statement is denoted $\text{if}(x)$ or $\text{while}(x)$. Declassification and ORAM initialization will generate instruction traces $\text{declass}(x,y)$ and $\text{init}(x,y)$, respectively. The trace $\epsilon$ indicates an unobservable statement execution (e.g., Bob cannot observe Alice executing her local code). Trace equivalence (i.e. $t_1 \equiv t_2$) is defined in Figure 4.5.

The second sort of trace is a memory trace $t$, which captures reads or writes of variables visible to one or the other party. Here are the different memory trace events:

- **P**: Operations on public arrays generate memory event $\text{readarr}(x,n,v)$ or $\text{writearr}(x,n,v)$ visible to both parties, including the variable name $x$, the index $n$, and the value $v$ read or written. Operations on public variables generate memory event $\text{read}(x,v)$ or $\text{write}(x,v)$. To initialize an ORAM from a public array will access each item in the array, so a sequence of $\text{readarr}(x,i,m(i))$ for $i = 0, \ldots, |m| - 1$, is visible to both Alice and Bob. We use $\text{arr}(x,m)$ to indicate such a sequence of memory events.

- **A/B**: Operations on Alice’s secret arrays generate memory event $\text{readarr}(x,n,v)$ or $\text{writearr}(x,n,v)$ visible to Alice only. Operations on Alice’s secret variables generate memory event $\text{read}(x,v)$ or $\text{write}(x,v)$ visible to Alice only. Initial-
izing an ORAM from Alice’s secret array generate memory events \( \text{arr}(x, m) \) visible to Alice only. Operations on Bob’s secret arrays/variables are handled similarly.

- \( \mathbf{0} \): Operations on a secret array generate memory event \( x \) visible to both Alice and Bob, containing only the variable name, but not the index or the value.

A special case is the initialization of ORAM bank \( x \) with \( y \)’s value: a memory trace \( y \), but not its content, is observed.

Memory-trace equivalence is defined similarly to instruction-trace equivalence.

Finally, each declassification executed by the program produces a declassification event \( (d_a, d_b) \), where Alice learns the declassification \( d_a \) and Bob learns \( d_b \).

There is also an empty declassification event \( \epsilon \), which is used for non-declassification statements. Given a declassification event \( D = (d_a, d_b) \), we write \( D[A] \) to denote Alice’s declassification \( d_a \) and \( D[B] \) to denote Bob’s declassification \( d_b \).

**Semantics rules.** Now we turn to the semantics, which consists of two judgments. Figure 4.6 defines rules for the judgment \( l \vdash (M, e) \Downarrow_{(t_a, t_b)} v \), which states that in mode \( l \), under memory \( M \), expression \( e \) evaluates to \( v \). This evaluation produces memory trace \( t_a \) (resp., \( t_b \)) for Alice (resp., Bob). Which memory trace event to emit is chosen using the function \( \text{select} \), which is defined in Figure 4.5. The security label \( l \) is passed in by the corresponding assignment statement (i.e. \( l : x := e \) or \( l : y[x_1] := x_2 \)). If \( l \) is \( \mathbf{A} \) or \( \mathbf{B} \), then the accesses to public variables are not observable to the other party, whereas if \( l \) is \( \mathbf{0} \) then both parties know that an access took place; the \( l^* \) label defined in E-Var and E-Array ensures the proper visibility of such
Figure 4.6: Operational semantics for expressions in SCVM

events. Note the E-Array rule uses the get() function to retrieve an element from an array; this function will return a default value 0 if the index is out of bounds. Most elements of the rules are otherwise straightforward.

Figure 4.7 and 4.8 define rules for the judgment \( \langle M, S \rangle \xrightarrow{(i_a, t_a), (i_b, t_b)} \langle M', S' \rangle : D \), which says that under memory \( M \), the statement \( S \) reduces to memory \( M' \) and statement \( S' \), while producing instruction trace \( i_a \) (resp., \( i_b \)) and memory trace \( t_a \) (resp., \( t_b \)) for Alice (resp., Bob), and generating declassification \( D \). Most rules are standard, except for handling memory traces and instruction traces. Instruction traces are handled using function \( inst \) defined in Figure 4.5. This function is defined
such that if the label $l$ of a statement is A or B, then the other party cannot observe the statement; otherwise, both parties observe the statement.

A skip statement generates empty instruction traces and memory traces for both parties regardless of its label. An assignment statement first evaluates the expression to assign, and its trace and the write event constitute the memory trace for this statement. Note that expression is evaluated using the label $l$ of the assignment statement as per the discussion of E-Var and E-Array above.

Declassification $x := \text{declass}(y)$ declassifies a secret variable $y$ (labeled 0) to
\[ M(y) = (m, l) \quad l \vdash (M, x_i) \Downarrow_{(t_a, t_b)} v_i \quad i = 1, 2 \]
\[ M' = M[y \mapsto (m', l)] \]
\[ (t'_a, t'_b) = \text{select}(l, \text{writearr}(y, v_1, v_2), y) \]
\[ t_a = t_{1a} @ t_{2a} @ t'_a \quad t_b = t_{1b} @ t_{2b} @ t'_b \]
\[ (i_a, i_b) = \text{inst}(l, y[x_1] := x_2) \]

\[
\langle M, l : y[x_1] := x_2 \rangle \xrightarrow{(i_a, t_a, i_b, t_b)} \langle M', l : \text{skip} \rangle : \epsilon
\]

\[ M(x) = (0, l) \quad (i_a, i_b) = \text{inst}(l, \text{while}(x)) \]
\[ (t_a, t_b) = \text{select}(l, \text{read}(x, 0), x) \]
\[ S = l : \text{while}(x) \text{do } S' \]

\[
\langle M, S \rangle \xrightarrow{(i_a, t_a, i_b, t_b)} \langle M, l : \text{skip} \rangle : \epsilon
\]

\[ M(x) = (v, l) \quad v \neq 0 \]
\[ (t_a, t_b) = \text{select}(l, \text{read}(x, v), x) \]
\[ S = l : \text{while}(x) \text{do } S' \]

\[
\langle M, S \rangle \xrightarrow{(i_a, t_a, i_b, t_b)} \langle M, S' ; S \rangle : \epsilon
\]

\[
\langle M, S_1 \rangle \xrightarrow{(i_a, t_a, i_b, i'_b)} \langle M', S'_1 \rangle : D
\]
\[ \langle M, S_1 ; S_2 \rangle \xrightarrow{(i_a, t_a, i_b, i'_b)} \langle M', S'_1 ; S'_2 \rangle : D \]

\[
\langle M, S \rangle \xrightarrow{(i_{a}, t_{a}, i_{b}, i'_{b})} \langle M', S' \rangle : \epsilon
\]
\[ \langle M', S'' \rangle \xrightarrow{(i_{a}^{''}, t_{a}^{''}, i_{b}^{''}, i'_{b}^{''})} \langle M'', S''' \rangle : D \]
\[ \langle M, S \rangle \xrightarrow{(i_{a}, t_{a}, i_{b}, i'_{b}, t_{a}^{''}, i'_{b}^{''})} \langle M'', S'''' \rangle : D \]

Figure 4.8: Operational semantics for statements in SCVM (Part 2)
a non-secret variable $x$ (not labeled $0$). Both Alice and Bob will observe that $y$ is accessed (as defined by $t_a$ and $t_b$), whereas the label $l$ of variable $x$ determines who sees the declassified value as indicated by the declassification event $D$.

ORAM initialization produces a shared, secret array $x$ from an array $y$ provided by one party. Thus, the security label of $x$ must be $0$, and the security label of $y$ must not be $0$. This rule implies that the party who holds $y$ will observe memory events $arr(y, m)$, and then both parties can observe accesses to $x$.

Rule S-ArrAss handles an array assignment. Similar to rule E-Array, out-of-bounds indices are ignored (cf. the $set()$ function in Figure 4.5). For if-statements and while-statements, no memory traces are observed other than those observed from evaluating the guard $x$.

Rule S-Seq sequences execution of two statements in the obvious way. Finally, rule S-Concat says that if $\langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M'', S'' \rangle : D$, the transformation may perform one or more small-step transformations that generate no declassification.

4.4.3 Security

The ideal functionality $F$ defines the baseline of security, emulating a trusted third party that runs the program using Alice and Bob’s data, directly revealing to them only the explicitly declassified values. In a real implementation run directly by Alice and Bob, however, each party will see additional events of interest, in particular an instruction trace and a memory trace (as defined by the semantics). Importantly, we want to show that these traces provide no additional information
about the opposite party’s data beyond what each party could learn from observing \( \mathcal{F} \). We do this by proving that in fact these traces can be simulated by Alice and Bob using their local data and the list of declassification events provided by \( \mathcal{F} \). As such, revealing the instruction and memory traces (as in a real implementation) provides no additional useful information.

We call our security property \( \Gamma \)-simulatability. To state this property formally, we first define a multi-step version of our statement semantics:

\[
\langle M, P \rangle \xrightarrow{\Gamma,(i_a,t_a,i_b,t_b)} \ast \langle M_n, P_n \rangle : D_1, ..., D_n \\
\langle M_n, P_n \rangle \xrightarrow{(i'_a,t'_a,i'_b,t'_b)} \langle M', P' \rangle : D'
\]

This allows programs to make multiple declassifications, accumulating them as a trace, while remembering only the most recent instruction and memory traces and ensuring that intermediate memories are \( \Gamma \)-compatible.

**Definition 8** (\( \Gamma \)-simulatability). Let \( \Gamma \) be a type environment, and \( P \) a program. We say \( P \) is \( \Gamma \)-simulatable if there exist simulators \( \text{sim}_A \) and \( \text{sim}_B \), which run polynomial time in the data size, such that for all \( M, i_a, t_a, i_b, t_b, M', P', D_1, ..., D_n \), if

\[
\langle M, P \rangle \xrightarrow{\Gamma,(i_a,t_a,i_b,t_b)} \ast \langle M', P' \rangle : D_1, ..., D_n, \text{ then } \text{sim}_A(M[\{P,A\}], D_1[A], ..., D_{n-1}[A]) \equiv (i_a, t_a) \text{ and } \text{sim}_B(M[\{P,B\}], D_1[B], ..., D_{n-1}[B]) \equiv (i_b, t_b).
\]

Intuitively, if \( P \) is \( \Gamma \)-simulatable there exists a simulator \( \text{sim}_A \) that, given public data \( M[\{P\}] \), Alice’s secret data \( M[\{A\}] \), and all outputs \( D_1[A], ..., D_{n-1}[A] \) declassified to Alice so far, can compute the instruction traces \( i_a \) and memory traces
$t_a$ produced by the ideal semantics up until the next declassification event $D_n$, regardless of the values of Bob’s secret data.

Note that $\Gamma$-simulatability is \textit{termination insensitive}, and information may be leaked based upon whether a program terminates or not [9]. However, as long as all runs of a program are guaranteed to terminate (as is typical for programs run in secure-computation scenarios), no information leakage occurs.

4.4.4 Type System

This section presents our type system, which we prove ensures $\Gamma$-simulatability. There are two judgments, both defined in Figure 4.9. The first, written $\Gamma \vdash e : \tau$, states that under environment $\Gamma$, expression $e$ evaluates to type $\tau$. The second judgment, written $\Gamma, pc \vdash S$, states that under environment $\Gamma$ and a \textit{label context} $pc$, a labeled statement $S$ is type-correct. Here, $pc$ is a label that describes the ambient control context; $pc$ is set according to the guards of enclosing conditionals or loops. Note that since a program cannot execute an if-statement or a while-statement whose guard is secret, $pc$ can be one of $P$, $A$, or $B$, but not $O$. Intuitively, if $pc$ is $A$ (resp., $B$), then the statement is part of Alice’s (resp., Bob’s) local code. In general, for a labeled statement $S = l : s$ we enforce the invariant $pc \sqsubseteq l$, and if $pc \neq P$, then $pc = l$. In so doing, we ensure that if the security label of a statement is $A$ (including if-statements and while-statements), then all nested statements also have security label $A$, thus ensuring they are only visible to Alice. On the other hand, under a public context, the statement label is unrestricted.
Now we consider some interesting aspects of the rules. Rule T-Assign requires \( pc \sqcup l' \sqsubseteq l \), as is standard: \( pc \sqsubseteq l \) prevents implicit flows, and \( l' \sqsubseteq l \) prevents explicit ones. We further restrict that \( \Gamma(x) = \text{Nat } l \), i.e., the assigned variable should have the same security label as the instruction label. Rule T-ArrAss and rule T-Array require that for an array expression \( y[x] \), the security label of \( x \) should be lower than the security label of \( y \). For example, if \( x \) is Alice’s secret variable, then \( y \) should be either Alice's local array, or an ORAM shared between Alice and Bob. If \( y \) is

\[
\frac{\Gamma \vdash e : \tau}{T-\text{Var}} \quad \frac{\Gamma(x) = \text{Nat } l}{T-\text{Var}} \quad \frac{\Gamma \vdash n : \text{Nat } P}{T-\text{Const}}
\]

\[
\frac{\Gamma \vdash x : \text{Nat } l_1 \quad \Gamma \vdash x_2 : \text{Nat } l_1 \sqcup l_2}{T-\text{Op}} \quad \frac{\Gamma(x_1) = \text{Nat } l_1 \quad \Gamma(x_2) = \text{Nat } l_2}{T-\text{Op}} \quad \frac{\Gamma \vdash \text{mux}(x_1, x_2, x_3) : \text{Nat } l}{T-\text{Mux}} \quad \frac{\Gamma \vdash \text{mux}(x_1, x_2, x_3) : \text{Nat } l}{T-\text{Mux}}
\]

\[
\frac{\Gamma \vdash \text{Array } l_1 \quad \Gamma(x) = \text{Nat } l_2 \quad l_2 \sqsubseteq l_1}{T-\text{Array}} \quad \frac{\Gamma(y) = \text{Array } l_1 \quad \Gamma(x) = \text{Nat } l_2 \quad l_2 \sqsubseteq l_1}{T-\text{Array}} \quad \frac{\Gamma \vdash y[x] : \text{Nat } l_1}{T-\text{Array}} \quad \frac{\Gamma \vdash y[x] : \text{Nat } l_1}{T-\text{Array}}
\]

\[
\frac{\Gamma \vdash e : \text{Nat } l}{T-\text{Assign}} \quad \frac{\Gamma \vdash e : \text{Nat } l \quad pc \sqsubseteq l \quad pc \neq P \Rightarrow pc = l}{T-\text{Assign}} \quad \frac{\Gamma \vdash l : \text{skip}}{T-\text{Assign}} \quad \frac{\Gamma \vdash l : \text{skip}}{T-\text{Assign}}
\]

\[
\frac{pc = P \quad \Gamma(x) = \text{Array } 0}{T-\text{ORAM}} \quad \frac{pc = P \quad l \neq 0 \quad \Gamma(x) = \text{Array } l}{T-\text{ORAM}} \quad \frac{pc \neq P \Rightarrow pc = l}{T-\text{ORAM}} \quad \frac{pc \neq P \Rightarrow pc = l}{T-\text{ORAM}}
\]

\[
\frac{\Gamma \vdash \text{oram}(y) : \text{Array } l}{T-\text{ORAM}} \quad \frac{\Gamma \vdash \text{oram}(y) : \text{Array } l}{T-\text{ORAM}} \quad \frac{\Gamma \vdash \text{oram}(y) : \text{Array } l}{T-\text{ORAM}} \quad \frac{\Gamma \vdash \text{oram}(y) : \text{Array } l}{T-\text{ORAM}}
\]

\[
\frac{\Gamma \vdash \text{declass}(y) : \text{Nat } l \quad l \neq 0}{T-\text{Declass}} \quad \frac{\Gamma \vdash \text{declass}(y) : \text{Nat } l \quad l \neq 0}{T-\text{Declass}} \quad \frac{pc \neq P \Rightarrow l = pc}{T-\text{Declass}} \quad \frac{pc \neq P \Rightarrow l = pc}{T-\text{Declass}}
\]

\[
\frac{pc \sqcup l \sqsubseteq l \quad pc \neq P \Rightarrow pc = l}{T-\text{Declass}} \quad \frac{pc \sqcup l \sqsubseteq l \quad pc \neq P \Rightarrow pc = l}{T-\text{Declass}} \quad \frac{pc \sqcup l \sqsubseteq l \quad pc \neq P \Rightarrow pc = l}{T-\text{Declass}} \quad \frac{pc \sqcup l \sqsubseteq l \quad pc \neq P \Rightarrow pc = l}{T-\text{Declass}}
\]

\[
\frac{\Gamma \vdash \text{if}(x) \text{then } S_1 \text{else } S_2 : \text{Nat } l \quad pc \sqsubseteq l \quad l \neq 0}{T-\text{Cond}} \quad \frac{\Gamma \vdash \text{if}(x) \text{then } S_1 \text{else } S_2 : \text{Nat } l \quad pc \sqsubseteq l \quad l \neq 0}{T-\text{Cond}} \quad \frac{pc \neq P \Rightarrow l = pc}{T-\text{Cond}} \quad \frac{pc \neq P \Rightarrow l = pc}{T-\text{Cond}}
\]

\[
\frac{\Gamma, l \vdash S_i \quad i = 1, 2}{T-\text{Cond}} \quad \frac{\Gamma, l \vdash S_i \quad i = 1, 2}{T-\text{Cond}} \quad \frac{\Gamma, l \vdash S_i \quad i = 1, 2}{T-\text{Cond}} \quad \frac{\Gamma, l \vdash S_i \quad i = 1, 2}{T-\text{Cond}}
\]

\[
\frac{\Gamma \vdash \text{while}(x) \text{do } S : \text{Nat } l \quad pc \sqsubseteq l \quad pc \neq P \Rightarrow l = pc}{T-\text{While}} \quad \frac{\Gamma \vdash \text{while}(x) \text{do } S : \text{Nat } l \quad pc \sqsubseteq l \quad pc \neq P \Rightarrow l = pc}{T-\text{While}} \quad \frac{\Gamma \vdash \text{while}(x) \text{do } S : \text{Nat } l \quad pc \sqsubseteq l \quad pc \neq P \Rightarrow l = pc}{T-\text{While}} \quad \frac{\Gamma \vdash \text{while}(x) \text{do } S : \text{Nat } l \quad pc \sqsubseteq l \quad pc \neq P \Rightarrow l = pc}{T-\text{While}}
\]

Figure 4.9: Type System for SCVM

\( \Gamma \vdash e : \tau \quad T-\text{Var} \quad \frac{\Gamma(x) = \text{Nat } l}{\Gamma \vdash x : \text{Nat } l} \quad T-\text{Const} \quad \frac{\Gamma \vdash n : \text{Nat } P}{T-\text{Const}} \)

\( T-\text{Op} \quad \frac{\Gamma(x_1) = \text{Nat } l_1 \quad \Gamma(x_2) = \text{Nat } l_2}{\Gamma \vdash x_1 \text{ op } x_2 : \text{Nat } l_1 \sqcup l_2} \quad T-\text{Mux} \quad \frac{\Gamma \vdash \text{mux}(x_1, x_2, x_3) : \text{Nat } l}{T-\text{Mux}} \)

\( T-\text{Array} \quad \frac{\Gamma \vdash y[x] : \text{Nat } l_1}{T-\text{Array}} \quad \frac{\Gamma \vdash y[x] : \text{Nat } l_1}{T-\text{Array}} \quad \frac{\Gamma \vdash y[x] : \text{Nat } l_1}{T-\text{Array}} \quad \frac{\Gamma \vdash y[x] : \text{Nat } l_1}{T-\text{Array}} \)
Bob’s secret variable, or a public variable, then Bob can observe which indices are accessed, and then infer the value of $x$. In the example from Figure 4.2, the array access $\text{vis}[\text{bestj}]$ on line 9 requires that $\text{vis}$ be an ORAM variable since $\text{bestj}$ is.

For rules T-Declass and T-ORAM, since declassification and ORAM initialization statements both require secure computation, we restrict the statement label to be $\mathcal{O}$. Since these two statements cannot be executed in Alice’s or Bob’s local mode, we restrict that $pc = \mathcal{P}$.

Rule T-Cond deals with if-statements; T-While handles while loops similarly. First of all, we restrict $pc \subseteq l$ and $\Gamma(x) = \text{Nat } l$ for the same reason as above. Further, the rule forbids $l$ to be equal to $\mathcal{O}$ to avoid an implicit flow revealed by the program’s control flow. An alternative way to achieve instruction- and memory- trace obliviousness is through padding [58]. However, in the setting of secure-computation, padding achieves the same performance as rewriting a secret-branching statement into a $\text{mux}$ (or a sequence of them). And, using padding would require reasoning about trace patterns, a complication our type system avoids.

A well-typed program is $\Gamma$-simulatable:

**Theorem 3.** If $\Gamma, P \vdash S$, then $S$ is $\Gamma$-simulatable.

Notice that some rules allow a program to get stuck. For example, in rule S-ORAM, if the statement is $l : x := \text{oram}(y)$ but $l \neq \mathcal{O}$, then the program will not progress. We define a property called $\Gamma$-progress that formalizes the notion of a program that never gets stuck.

**Definition 9** ($\Gamma$-progress). Let $\Gamma$ be a type environment, and let $P = P_0$ be a
program. We say $P$ enjoys $\Gamma$-progress if for any $\Gamma$-compatible memories $M_0, \ldots, M_n$ for which $(M_j, P_j) \xrightarrow{(i_a^j, t_a^j, i_b^j, t_b^j)} (M_{j+1}, P_{j+1}) : D^j$ for $j = 0, \ldots, n - 1$, either $P_n = l$ : skip, or there exist $i'_a, t'_a, i'_b, t'_b, M', P'$ such that $(M_n, P_n) \xrightarrow{(i'_a, t'_a, i'_b, t'_b)} (M', P') : D'$.

$\Gamma$-progress means, in particular, that the third bullet in step (2) of the ideal functionality (Section 4.4.2) does not occur for type-correct programs.

A well-typed program never gets stuck:

**Theorem 4.** If $\Gamma, P \vdash S$, then $S$ enjoys $\Gamma$-progress.

Proofs of both theorems above can be found in Appendix C.

4.4.5 From SCVM Programs to Secure Protocols

Let $P$ be a program, and let $F$ be the ideal functionality based on this program as described earlier. Here we define a hybrid-world protocol $\pi^\mathcal{G}$ based on $P$, where $\mathcal{G} = (\mathcal{F}_{\text{op}}, \mathcal{F}_{\text{mux}}, \mathcal{F}_{\text{oram}}, \mathcal{F}_{\text{declass}})$ is a fixed set of ideal functionalities that implement simple binary operations ($\mathcal{F}_{\text{op}}$), a MUX operation ($\mathcal{F}_{\text{mux}}$), ORAM access ($\mathcal{F}_{\text{oram}}$), and declassification ($\mathcal{F}_{\text{declass}}$). Input to each of these ideal functionalities can either be Alice or Bob’s local inputs, public inputs, and/or the shares of secret inputs (each share supplied by Alice and Bob respectively). Each ideal functionality is explicitly parameterized by the types of the inputs. Further, except for $\mathcal{F}_{\text{declass}}$ which performs an explicit declassification, all other ideal functionalities return shares of the computation or memory fetch result to Alice and Bob, respectively. Further details of the ideal functionalities are given in Appendix D, along with formal definitions of the simulator and hybrid world semantics.
Informally, the hybrid world protocol $\pi^G$ runs as follows:

1. Alice and Bob first agree on public values, ensuring that $M_A[\{P\}] = M_B[\{P\}]$.
   During the protocol each maintains a declassification list, for keeping track of previously declassified values, and a secret memory that contains shares of secret (non-ORAM) variables. To start, both the lists and memories are empty, i.e., $\text{decls}_A := \text{decls}_B := \epsilon$ and $M^S_A = M^S_B = \emptyset$.

2. Alice runs her simulator (locally) on her initial memory to obtain $(i_a, t_a) = \text{sim}_A(M_A, \text{decls}_A)$, where $i_a$ and $t_a$ cover the portion of the execution starting from just after the last provided declassification (i.e., the final $d_a$ in the list $\text{decls}_A$) up to the next declassification instruction or the terminating skip statement. Bob does likewise to get $(i_b, t_b) = \text{sim}_B(M_B, \text{decls}_B)$.

3. Alice executes the instructions in $i_a$ using the hybrid-world semantics, which reads (and writes) secret shares from (to) $M^S_A$ and obtains the values of other reads from events observed in $t_a$. Bob does similarly with $i_b$, $M^S_B$ and $t_b$. The semantics covers three cases:

   - If an instruction in $i_a$ is labeled $P$, then so is the corresponding instruction in $i_b$. Both parties execute the instruction.
   - If an instruction in $i_a$ is labeled $A$, then Alice executes this instruction locally. Bob does similarly for instructions labeled $B$.
   - If an instruction in $i_a$ is labeled $O$, then so is the corresponding instruction in $i_b$. Alice and Bob call the appropriate ideal-world functionality from
$G$ to execute this instruction. If the instruction is a declassification, then $F_{\text{declass}}$ will generate an event $(d_a, d_b)$.

4. If the last instruction executed in step 3 is a declassification, then Alice appends her declassification to her local declassification list (i.e., $\text{decls}_A := \text{decls}_A[+d_a]$), and Bob does likewise; then both repeat step 2. Otherwise, the protocol completes.

We have proved that if $P$ is $\Gamma$-simulatable, then $\pi^G$ securely implements $F$ against semi-honest adversaries.

**Theorem 5.** (Informally) Let $P$ be a program, $F$ the ideal functionality corresponding to $P$, and $\pi^G$ the protocol corresponding to $P$ as described above. If $P$ is $\Gamma$-simulatable, then $\pi^G$ securely implements $F$ against semi-honest adversaries in the $G$-hybrid model.

Using standard composition results for cryptographic protocols, we obtain as a corollary that if all ideal functionalities in $G$ are implemented by semi-honest secure protocols, the resulting (real-world) protocol securely implements $F$ against semi-honest adversaries.

A formal definition of $\pi^G$, formal theorem statement, and a proof of the theorem can be found in Appendix D.

4.5 Compilation

We informally discuss how to compile an annotated C-like source language into a SCVM program. An example of our source language is:
int sum(alice int x, bob int y) {
    return x<y ? 1 : 0;
}

The program’s two input variables, \(x\) and \(y\), are annotated as Alice’s and Bob’s data, respectively, while the unannotated return type \texttt{int} indicates the result will be known to both Alice and Bob. Programmers need not annotate any local variables. To compile such a program into a \texttt{SCVM} program, the compiler takes the following steps.

**Typing the source language.** As just mentioned, source level types and initial security label annotations are assumed given. With these, the type checker infers security labels for local variables using a standard security type system [79] using our lattice (Section 4.4.4). If no such labeling is possible without violating security (e.g., due to a conflict in the initial annotation), the program is rejected.

**Labeling statements.** The second task is to assign a security label to each statement. For assignment statements and array assignment statements, the label is the least upper bound of all security labels of the variables occurring in the statement. For an if-statement or a while-statement, the label is the least upper bound of all security labels of the guard variables, and all security labels in the branches or loop body.

**On secret branching.** The type system defined in Section 4.4.4 will reject an if-statement whose guard has security label \(0\). As such, if the program branches on secret data, we must compile it into \textit{if-free} \texttt{SCVM} code, using \texttt{mux} instructions. The idea is to execute both branches, and use \texttt{mux} to activate the relevant effects,
based on the guard. To do this, we convert the code into Static-Single-Assignment form (SSA) [7], and then replace occurrences of the $\phi$-operator with a \texttt{mux}. The following example demonstrates this process:

\texttt{if(s) then x:=1; else x:=2;}

The SSA form of the above code is

\texttt{if(s) then x1:=1; else x2:=2; x:=\phi(x1, x2);}

Then we eliminate the if-structure and substitute the $\phi$-operator to achieve the final code:

\texttt{x1:=1; x2:=2; x:=mux(s, x1, x2)}

(Note that, for simplicity, we have omitted the security labels on the statements in the example.)

**On secret while loops.** The type system requires that while loop guards only reference public data, so that the number of iterations does not leak information. A programmer can work around this restriction by imposing a constant bound on the loop; e.g., manually translating \texttt{while (s) do S} to \texttt{while (p) do if (s) S else skip}, where $p$ defines an upper bound on the number of iterations.

**Declassification.** The compiler will emit a declassification statement for each return statement in the source program. To avoid declassifying in the middle of local code, the type checker in the first phase will check for this possibility and relabel statements accordingly.

**Extension for non-oblivious secret RAM.** The discussion so far supports only secret ORAMS. To support non-oblivious secret RAM in \texttt{SCVM}, we add an ad-
ditional security label $N$ such that $P \subseteq N \subseteq 0$. To incorporate such a change, the memory trace for the semantics should include two more kinds of trace event, $nread(x, i)$ and $nwrite(x, i)$, which represent that only the index of an access is leaked, but not the content. Since label $N$ only applies to arrays, we allow types $\text{Array } N$ but not types $\text{Nat } N$. The rules $T$-Array and $T$-ArrAss should be revised to deal with the non-oblivious RAM. For example, for rule $T$-ArrAss, where $l$ is the security label for the array, $l_1$ is the security label of the index variable and $l_2$ is the security label of the value variable, the type system should still restrict $l_1 \sqsubseteq l$, but if $l = N$, the type system accepts $l_2 = O$, but requires $l_1 = P$.

Correctness. We do not prove the correctness of our compiler, but instead can use a SCVM type checker (using the above extension) for the generated SCVM code, ensuring it is $\Gamma$-simulatable. Ensuring the correctness of compilers is orthogonal and outside the scope of this work, and existing techniques [20] can potentially be adapted to our setting.

Compiling Dijkstra’s algorithm. We explain how compilation works for Dijkstra’s algorithm, previously shown in Figure 4.2. First, the type checker for the source program determines how memory should be labeled. It determines that the security labels for $\text{bestj}$ and $\text{bestdis}$ should be $O$, and the arrays $\text{dis}$ and $\text{vis}$ should be secret-shared between Alice and Bob, since their values depend on both Alice’s input (i.e., the graph’s edge weights) and Bob’s input (i.e., the source). Then, since on line 9 array $\text{vis}$ is indexed with $\text{bestj}$, variable $\text{vis}$ should also be put in an ORAM. Similarly, on line 12, array $\text{e}$ is indexed by $\text{bestj}$ so it must also be secret; as such we must promote $\text{e}$, owned by Alice, to be in ORAM, which we do by
initializing a new ORAM-allocated variable orame to e at the start of the program.

The type checker then uses the variable labeling to determine the statement labeling. Statements on lines 4–7, 9, and 11–13, require secure computation and thus are labeled as O. Loop control-flow statements are computed publicly, so they are labeled as P.

The two if-statements both branch on ORAM-allocated data, so they must be converted to mux operations. Lines 4–7 are transformed (in source-level syntax) as follows

\[
\begin{align*}
\text{cond3} &:= \neg\text{vis}[j] \land (\text{bestj}<0 \lor \text{dis}[j]<\text{bestdis}); \\
\text{bestj} &:= \text{mux}(\text{cond3}, j, \text{bestj}); \\
\text{bestdis} &:= \text{mux}(\text{cond3}, \text{dis}[j], \text{bestdis});
\end{align*}
\]

Lines 11-13 are similarly transformed

\[
\begin{align*}
\text{tmp} &:= \text{bestdis} + \text{orame}[	ext{bestj}\times n + j]; \\
\text{cond4} &:= \neg\text{vis}[j] \land (\text{tmp}<\text{dis}[j]); \\
\text{dis}[j] &:= \text{mux}(\text{cond4}, \text{tmp}, \text{dis}[j]);
\end{align*}
\]

Finally, the code is translated into SCVM’s three-address code style syntax.

4.6 Evaluation

Programs. We have built several secure two-party computation applications. As run-once tasks, we implemented both the Knuth-Morris-Pratt (KMP) string-matching algorithm as well as Dijkstra’s shortest-path algorithm. For repeated sublinear-time database queries, we considered binary search and the heap data
In comparison, some earlier circuit-model compilers involve copying datasets into circuits, and therefore the compile-time can be large \cite{54,65} (e.g., Kreuter et al. \cite{54} report a compilation time of roughly 1000 seconds for an implementation of an algorithm to compute graph isomorphism on 16-node graphs).

In our experiments, we manually checked the correctness of compiled programs (we have not yet implemented a type checker for SCVM, though doing so should be straightforward).

### 4.6.1 Evaluation Methodology

Although our techniques are compatible with any cryptographic back-end secure in the semi-honest model by the definition of Canetti \cite{15}, we use the garbled circuit approach in our evaluation \cite{46}.

We measure the computational cost by calculating the number of encryptions required by the party running as the circuit generator (the party running as the...
evaluator does less work). For every non-XOR binary gate, the generator makes 3 block-cipher calls; for every oblivious transfer (OT), 2 block-cipher operations are required since we rely on OT extension [47]. For the run-once applications (i.e., Dijkstra shortest distance, KMP-matching, aggregation, inverse permutation), we count in the ORAM initialization cost when comparing to the automated circuit approach (which doesn’t require RAM initialization). The ORAM initialization can be done using a Waksman shuffling network [89]. For the applications expecting multiple executions we do not count the ORAM initialization cost since this one-time overhead will be amortized to (nearly) 0 over many executions.

We implemented the binary tree-based ORAM of Shi et al. [80] using garbled circuits, so that array accesses reveal nothing about the (logical) addresses nor the outcomes. Throughout the experiments, we set the ORAM bucket size to 32 (i.e., each tree-node can store up to 32 blocks), which corresponds to roughly 25-bit of statistical security (according to the simulation of ORAM failures). Following Gordon et al.’s ORAM encryption technique [38], every block is XOR-shared (i.e., the client stores secret key $k$ while the server stores $(r, f_k(r) \oplus m)$ where $f$ is a family of psuedorandom permutations and $m$ the data block). This adds one additional cipher operation per block (when the length of an ORAM block is less than the width of the cipher). We note specific choices of the ORAM parameters in related discussion of each application.

**Metrics.** We use the number of block-cipher evaluations as our performance metric. Measuring the performance by the number of symmetric encryptions (instead of wall clock times) makes it easier to compare with other systems since the numbers can
be independent of the underlying hardware and ciphering algorithms. Additionally, in our experiments these numbers represent bandwidth consumption since every encryption is sent over the network. Therefore, we do not report separately the bandwidth used. Modern processors with AES support can compute $10^8$ AES-128 operations per second.

### 4.6.2 Comparison with Automated Circuits

Presently, automated secure computation implementations largely focus on the circuit-model of computation, handling array accesses by linearly scanning the entire array with a circuit every time an array lookup happens; this incurs prohibitive overhead when the dataset is large. In this section, we compare our approach with the existing compiled circuits, and demonstrate that our approach scales much better with respect to dataset size.
4.6.2.1 Repeated sublinear-time queries

In this scenario, ORAM initialization is a one-time operation whose cost can be amortized over multiple subsequent queries, achieving sublinear amortized cost per query.

**Binary search.** One example application we tested is binary search, where one party owns a confidential (sorted) array of size $n$, and the other party searches for (secret) values stored in that array.

In our experiments, we set the ORAM bucket size to 32. For binary search, we aligned our experimental settings with those of Gordon et al. [38], namely, assuming the size of each data item is 512 bits. We set the recursion factor to 8 (i.e., each block can store up to 8 indices for the data in the upper level recursion tree) and the recursion cut-off threshold to 1000 (namely no more recursion once fewer than 1000 units are to be stored). Comparing to a circuit-model implementation—which uses...
a circuit of size $O(n \log n)$ that implements binary search—our approach is faster for all RAM sizes tested (see Figure 4.10). For $n = 2^{20}$, our approach achieves a $100 \times$ speedup.

Note it is also possible to use a smaller circuit of size $O(n)$ that just performs a linear scan over the data. However, such a circuit would have to be “hand-crafted,” and would not be output by automated compilation of a binary-search program. Our approach runs faster for large $n$ even when compared to such an implementation (see Figure 4.11). On data of size $n = 2^{20}$, our approach achieves a $5 \times$ speedup even when compared to this “hand-crafted” circuit-based solution.

**Heap.** Besides binary search, we also implemented an oblivious heap data structure (with 32-bit payload, i.e., size of each item). The costs of insertion and extraction respecting various heap sizes are given in Figure 4.12 and 4.13, respectively. The basic shapes of the performance curves are very similar to that for binary search (except that heap extraction is twice as slow as insertion because two comparisons are needed per level). We can observe an $18 \times$ speedup for both heap insertion and heap extraction when the heap size is $2^{20}$.

The speedup of our heap implementation over automated circuits is even greater when the size of the payload is bigger. At 512-bit payload, we have an $100 \times$ speedup for data size $2^{20}$. This is due to the extra work incurred from realizing the ORAM mechanism, which grows (in poly-logarithmic scale) with the size of the RAM but independent of the size of each data item.
4.6.2.2 Faster one-time executions

We present two applications: the Knuth-Morris-Pratt string-matching algorithm (representative of linear-time RAM programs) and Dijkstra’s shortest-path algorithm (representative of super-linear time RAM programs). We compare our approach with a naive program-to-circuit compiler which copies the entire array for every dynamic memory access.

The Knuth-Morris-Pratt algorithm. Alice has a secret string $T$ (of length $n$) while Bob has a secret pattern $P$ (of length $m$) and wants to scan through Alice’s string looking for this pattern. The original KMP algorithm runs in $O(n + m)$ time when $T$ and $P$ are in plaintext. Our compiler compiles an implementation of KMP into a secure string matching protocol preserving its linear efficiency up to a polylogarithmic factor (due to the ORAM technique).

We assume the string $T$ and the pattern $P$ both consist of 16-bit characters.
The recursion factor of the ORAM is set to 16. Figure 4.14 and 4.15 show our results compared to those when a circuit-model compiler is used. From Figure 4.14, we can observe that our approach is slower than the circuit-based approach on small datasets, since the overhead of the ORAM protocol dominates in such cases. However, the circuit-based approach’s running time increases more quickly as the dataset’s size increases. When \( m = 50 \) and \( n = 2 \times 10^6 \), our program runs \( 21 \times \) faster.

**Dijkstra’s algorithm.** Here Alice has a secret graph while Bob has a secret source/destination pair and wishes to compute the shortest distance between them. Compiling from a standard Dijkstra shortest-path algorithm, we obtain an \( O(n^2 \log^3 n) \)-overhead RAM-model protocol.

In our experiment, Alice’s graph is represented by an \( n \times n \) adjacency matrix (of 32-bit integers) where \( n \) is the number of vertices in the graph. The distances associated with the edges are denoted by 32-bit integers. We set ORAM recursion
factor to 8. The results (Figure 4.16) show that our scheme runs faster for all sizes of graphs tested. As the performance of our protocol is barely noticeable in Figure 4.16, the performance gaps between the two protocols for various $n$ is explicitly plotted in Figure 4.17. Note the shape of the speedup curve is roughly quadratic.

**Aggregation over sliding windows.** Alice has a key-value table, and Bob has a (size-$n$) array of keys. The secure computation task is the following: for every size-$k$ window on the key array, look up $k$ values corresponding to Bob’s $k$ keys within the window, and output the minimum value. Our compiler outputs a $O(n \log^3 n)$ protocol to accomplish the task. The optimized protocol performs significantly better, as shown in Figure 4.18 and 4.19 (we fixed the window size $k$ to 1000 and set recursion factor to 8, while varying the dataset from 1 to 6 million pairs).
4.6.3 Comparison with RAM-SC Baselines

**Benefits of instruction-trace obliviousness.** The RAM-SC technique of Gordon et al. [38], described in Section 4.2, uses a universal next-instruction circuit to hide the program counter and the instructions executed. Each instruction involves ORAM operations for instruction and data fetches, and the next-instruction circuit must effectively execute all possible instructions and use an $n$-to-1 multiplexer to select the right outcome. Despite the lack of concrete implementation for their general approach, we show through back-of-the-envelope calculations that our approach should be orders-of-magnitude faster.

Consider the problem of binary search over a 1-million item dataset: in each iteration, there are roughly 10 instructions to run, hence 200 instructions in total to complete the search. To run every instruction, a universal-circuit-based implementation has to execute every possible instruction defined in its instruction set. Even
if we conservatively assume a RISC-style instruction set, we would require over 9 million (non-free) binary gates to execute just a memory read/write over a 512M bit RAM. Plus, an extra ORAM read is required to obliviously fetch every instruction. Thus, at least a total of 3600 million binary gates are needed, which is more than 20 times slower than our result exploiting instruction trace obliviousness. Furthermore, notice that binary search is merely a case where the program traces are very short (with only logarithmic length). Due to the overwhelming cost of ORAM read/write
instructions, we stress that the performance gap will be much greater with respect to programs that have relatively fewer memory read/write instructions (comparing to binary search, 1 out of 10 instructions is a memory read instruction).

**Benefits of memory-trace obliviousness.** In addition to avoiding the overhead of a next-instruction circuit, SCVM avoids the overhead of storing all arrays in a single, large ORAM. Instead, SCVM can store some arrays as non-oblivious secret shared memory, and others in separate ORAM banks, rather than one large
Figure 4.20: SCVM’s Savings by memory-trace obliviousness optimization (inverse permutation). The non-linearity (around 60) of the curve is due to the increase of the ORAM recursion level at that point.

Figure 4.21: Savings by memory-trace obliviousness optimization (Dijkstra)

ORAM. Doing so does not compromise security because the type system ensures memory-trace obliviousness. Here we assess the advantages of these optimizations by comparing against SCVM programs compiled without the optimizations enabled. The results for two applications are given in Figure 4.20 and 4.21.

- **Inverse permutation.** Consider a permutation of size $n$, represented by an array $a$ of $n$ distinct numbers from 1 to $n$, i.e., the permutation maps the $i$-th object to the $a[i]$-th object. One common computation would be to compute

$$127$$
its inverse, e.g., to do an inverse table lookup using secret indices. The inverse permutation (with result stored in array $b$) can be computed with the loop:

$$\text{while } (i < n) \{ \ b[a[i]]=i; \ i=i+1; \}$$

The memory-trace obliviousness optimization automatically identifies that the array $a$ doesn’t need to be put in ORAM though its content should remain secret (because the access pattern to $a$ is entirely public known). This yields 50% savings, which is corroborated by our experiment results (Figure 4.20).

- **Dijkstra’s shortest path.** We discussed the advantages of memory-trace obliviousness in Section 4.3 with respect to Dijkstra’s algorithm. Our experiments show that we consistently save $15 \sim 20\%$ for all graph sizes. The savings rates for smaller graphs are in fact higher even though it is barely noticeable in the chart because of the fast (super-quadratic) growth of overall cost.

4.7 Conclusions

We describe the SCVM system as the first automated approach for RAM-model secure computation. In the next Chapter, we will extend SCVM to ObliVM with richer programming features to improve expressive power, easy programmability, and performance. Directions for future work include extending our framework to support malicious security; applying orthogonal techniques (e.g., [20]) to ensure correctness of the compiler; incorporating other cryptographic backends into our framework; and adding additional language features such as higher-dimensional arrays and structured data types.
Chapter 5: ObliVM: A Programming Framework for Secure Computation

In Chapter 4, we have seen that trace oblivious approach and SCVM system can improve the performance for RAM-model secure computation. In this Chapter, we extend SCVM to build a programming framework to make secure computation both easy-to-program and practically efficient.

Architecting a system framework for secure computation presents numerous challenges. First, the system must allow non-specialist programmers without security expertise to develop applications. Second, efficiency is a first-class concern in the design space, and scalability to big data is essential in many interesting real-life applications. Third, the framework must be reusable: expert programmers should be able to easily extend the system with rich, optimized libraries or customized cryptographic protocols, and make them available to non-specialist application developers.

We design and build ObliVM, a system framework for automated secure multiparty computation. ObliVM is designed to allow non-specialist programmers to write programs much as they do today, and our ObliVM compiler compiles the program to an efficient secure computation protocol. To this end, ObliVM offers a domain-specific language that is intended to address a fundamental representation gap,
namely, secure computation protocols (and other branches of modern cryptography) rely on circuits as an abstraction of computation, whereas real-life developers write programs instead. In architecting ObliVM, our main contribution is the design of programming support and compiler techniques that facilitate such program-to-circuit conversion while ensuring maximal efficiency. Presently, our framework assumes a semi-honest two-party protocol in the back end. To demonstrate an end-to-end system, we chose to implement an improved Garbled Circuit protocol as the back end, since it is among the most practical protocols to date. Our ObliVM framework, including source code and demo applications, will be open-sourced on our project website http://www.oblivm.com.

This chapter is based on a paper that I co-authored with Yan Huang, Kartik Nayak, Elaine Shi, and Xiao Shaun Wang [60]. I designed the novel programming language and implemented the programming frameworks. I developed the compiler to support the new language, which emits code that is runnable on a new secure computation backend designed and implemented by Yan Huang and Xiao Shaun Wang. I conducted experiments to show the compiler’s effectiveness with the help of Xiao Shaun Wang.

5.1 ObliVM Overview and Contributions

In designing and building ObliVM, we make the following contributions.

Programming abstractions for oblivious algorithms. The most challenging part about ensuring a program’s obliviousness is memory-trace obliviousness – there-
fore our discussions below will focus on memory-trace obliviousness. A straightforward approach (henceforth referred to as the generic ORAM baseline) is to provide an Oblivious RAM (ORAM) abstraction, and require that all arrays (whose access patterns depend on secret inputs) be stored and accessed via ORAM. This approach, which was effectively taken by SCVM [59], is generic, but does not necessarily yield the most efficient oblivious implementation for each specific program.

At the other end of the spectrum, a line of research has focused on customized oblivious algorithms for special tasks (sometimes also referred to as circuit structure design). For example, efficient oblivious algorithms have been demonstrated for graph algorithms [14, 37], machine learning algorithms [70, 71], and data structures [51, 66, 92]. The customized approach can outperform generic ORAM, but is extremely costly in terms of the amount of cryptographic expertise and time consumed.

ObliVM aims to achieve the best of both worlds by offering oblivious programming abstractions that are both user- and compiler friendly. These programming abstractions are high-level programming constructs that can be understood and employed by non-specialist programmers without security expertise. Behind the scenes, ObliVM translates programs written in these abstractions into efficient oblivious algorithms that outperform generic ORAM. When oblivious programming abstractions are not applicable, ObliVM falls back to employing ORAM to translate programs to efficient circuit representations. Presently, ObliVM offers the following oblivious programming abstractions: MapReduce abstractions, abstractions for oblivious data structures, and a new loop coalescing abstraction which enables novel
oblivious graph algorithms. We remark that this is by no means an exhaustive list of possible programming abstractions that facilitate obliviousness. It would be exciting future research to uncover new oblivious programming abstractions and incorporate them into our ObliVM framework.

**An expressive programming language.** ObliVM offers an expressive and versatile programming language called ObliVM-Lang. When designing ObliVM, we have the following goals.

- Non-specialist application developers find the language intuitive.

- Expert programmers should be able to extend our framework with new features. For example, an expert programmer should be able to introduce new, user-facing oblivious programming abstractions by embedding them as libraries in ObliVM (see Section 5.3.2 for an example).

- Expert programmers can implement even low-level circuit libraries directly atop ObliVM-Lang. Recall that unlike a programming language in the traditional sense, here the underlying cryptography fundamentally speaks only of AND and XOR gates. Even basic instructions such as addition, multiplication, and ORAM accesses must be developed from scratch by an expert programmer. In most previous frameworks, circuit libraries for these basic operations are developed in the back end. ObliVM, for the first time, allows the development of such circuit libraries in the source language, greatly reducing programming complexity. Section 5.4.1 demonstrates case studies for implementing basic arithmetic operations and Circuit ORAM atop our source
language ObliVM.

- Expert programmers can implement customized protocols in the back end (e.g., faster protocols for performing big integer operations or matrix operations), and export these customized protocols to the source language as native types and native functions.

To simultaneously realize these aforementioned goals, we need a much more powerful and expressive programming language than any existing language for secure computation [44, 54, 59, 75, 98]. Our ObliVM-Lang extends the SCVM language presented in Chapter 4 and offers new features such as phantom functions, generic constants, random types, as well as native types and functions. We will show why these language features are critical for implementing oblivious programming abstractions and low-level circuit libraries.

**Additional architectural choices.** ObliVM also allows expert programmers to develop customized cryptographic protocols (not necessarily based on Garbled Circuit) in the back end. These customized back end protocols can be exposed to the source language through native types and native function calls, making them immediately reusable by others. Section 5.5.1 describes an example where an expert programmer designs a customized protocol for BigInteger operations using additively-homomorphic encryption. The resulting BigInteger types and operations can then be exported into our source language ObliVM-Lang.
5.1.1 Applications and Evaluation

ObliVM’s easy programmability allowed us to develop a suite of libraries and applications, including streaming algorithms, data structures, machine learning algorithms, and graph algorithms. These libraries and applications will be shipped with the ObliVM framework. Our application-driven evaluation suggests the following results:

**Efficiency.** We use ObliVM’s user-facing programming abstractions to develop a suite of applications. We show that over a variety of benchmarking applications, the resulting circuits generated by ObliVM can be orders of magnitude smaller than the generic ORAM baseline (assuming that the state-of-the-art Circuit ORAM [90] is adopted for the baseline) under moderately large data sizes. We also compare our ObliVM-generated circuits with hand-crafted designs, and show that for a variety of applications, our auto-generated circuits are only 0.5% to 2% bigger in size than oblivious algorithms hand-crafted by human experts.

**Development effort.** We give case studies to show how ObliVM greatly reduces the development effort and expertise needed to create applications over secure computation.

**New oblivious algorithms.** We describe a few new oblivious algorithms that we uncover during this process of programming language and algorithms co-design. Specifically, we demonstrate new oblivious graph algorithms including oblivious Depth-First-Search for dense graphs, oblivious shortest path for sparse graphs, and an oblivious minimum spanning tree algorithm.
5.1.2 Threat Model, Deployment, and Scope

**Deployment scenarios and threat model.** ObliVM are designed for the same deployment scenarios under the same threat model as SCVM.

**Scope.** A subset of ObliVM’s source language ObliVM-Lang has a security type system which, roughly speaking, ensures that the program’s execution traces are independent of secret inputs [58,59].

ObliVM-Lang’s type system is further extended to support reasoning about the declassifications of *random numbers* to provide a principled guidance on how developers should use random numbers properly while enforcing security. This extended type system, however, does not guarantee the security on random number usages. We argue that this extended type system is still useful in capturing subtle bugs in the implementations like oblivious data structures. We leave developing a sound and complete type system to handle random numbers as a future work.

By designing a new language, ObliVM does not directly retrofit legacy code. Such a design choice maximizes opportunities for compile-time optimizations. We note, however, that in subsequent work joint with our collaborators [39], we have implemented a MIPS CPU in ObliVM, which can securely evaluate standard MIPS instructions in a way that leaks only the termination channel (i.e., total runtime of the program) – this secure MIPS CPU essentially provides backward compatibility atop ObliVM whenever needed.
5.2 Programming Language and Compiler

As mentioned earlier, we wish to design a powerful source language ObliVM-Lang such that an expert programmer can i) develop oblivious programming abstractions as libraries and offer them to non-specialist programmers; and ii) implement low-level circuit gadgets atop ObliVM-Lang.

ObliVM-Lang builds on top of the recent SCVM IR as described in Chapter 4—the only known language to date that supports ORAM abstractions, and therefore offers scalability to big data. In this section, we will describe new features that ObliVM-Lang offers and explain intuitions behind our security type system.

As compelling applications of ObliVM-Lang, in Section 5.3, we give concrete case studies and show how to implement oblivious programming abstractions and low-level circuit libraries atop ObliVM-Lang.

5.2.1 Language features for expressiveness and efficiency

Security labels. Except for the new random type introduced in Section 5.2.2, all other variables and arrays are either of a public or secure type. secure variables are secret-shared between the two parties such that neither party sees the value. public variables are observable by both parties. Arrays can be publicly or secretly indexable. For example,

- secure int10[public 1000] keys: secret array contents but indices to the array must be public. This array will be secret shared but not placed in
ORAMs.

- **secure int10[secure 1000] keys**: This array will be placed in a secret-shared ORAM, and we allow secret indices into the array.

**Standard features.** ObliVM-Lang allows programmers to use C-style keyword `struct` to define *record types*. It also supports *generic types* similar to templates in C++. For example, a binary tree with public topological structure but secret per-node data can be defined without using pointers (assuming its capacity is 1000 nodes):

```c
struct KeyValueTable<T> {
    secure int10[public 1000] keys;
    T[public 1000] values;
};
```

In the above, the type `int10` means that its value is a 10-bit signed integer. Each element in the array `values` has a generic type `T` similar to C++ templates. ObliVM-Lang assumes data of type `T` to be secret-shared. In the future, we will improve the compiler to support public generic types.

**Generic constants.** Besides general types, ObliVM-Lang also supports *generic constants* to further improve the reusability. Let us consider the following tree example:

```c
struct TreeNode@m<T> {
    public int@m key;
    T value;
    public int@m left, right;
};
struct Tree@m<T> {
    TreeNode<T>[public (1<<m)-1] nodes;
    public int@m root;
};
```
This code defines a binary search tree implementation of a key-value store, where keys are \( m \)-bit integers. The *generic constant* \( @m \) is a variable whose value will be instantiated to a constant. It hints that \( m \) bits are enough to represent all the position references to the array. The type \( \text{int}@m \) refers to an integer type with \( m \) bits. Further, the capacity of array \( \text{nodes} \) can be determined by \( m \) as well (i.e. \((1<<m)-1\)).

Note that Zhang et al. [98] also allow specifying the length of an integer, but require this length to be a hard-coded constant – this necessitates modification and recompilation of the program for different inputs. ObliVM-Lang’s generic constant approach eliminates this constraint, and thus improves reusability.

**Functions.** ObliVM-Lang allows programmers to define functions. For example, following the Tree defined as above, programmers can write a function to search the value associated with a given key in the tree as follows:

```java
1  T Tree@m<T>.search(public int @m key) {
2       public int @m now = this.root, tk;
3       T ret;
4       while (now != -1) {
5           tk = this.nodes[now].key;
6           if (tk == key)
7               ret = this.nodes[now].value;
8           if (tk <= key)
9               now = this.nodes[now].right;
10          else
11               now = this.nodes[now].left;
12       }
13       return ret
14  };
```

This function is a method of a Tree object, and takes a key as input, and returns a
value of type $T$. The function body defines three local variables `now` and `tk` of type `public int`@, and `ret` of type $T$. The definition of a local variable (e.g. `now`) can be accompanied with an optional initialization expression (e.g. `this.root`). When a variable (e.g. `ret` or `tk`) is not initialized explicitly, it is initialized to be a default value depending on its type.

The rest of the function is standard, C-like code, except that ObliVM-Lang requires exactly one return statement at the bottom of a function whose return type is not `void`. We highlight that ObliVM-Lang allows arbitrary looping on a public guard (e.g. line 4) without loop unrolling, which cannot be compiled in previous loop-elimination-based work [13, 43, 44, 55, 65, 98].

**Function types.** Programmers can define a variable to have function type, similar to function pointers in C. To avoid the complexity of handling arbitrary higher order functions, the input and return types of a function type must not be function types. Further, generic types cannot be instantiated with function types.

**Native primitives.** ObliVM-Lang supports native types and native functions. For example, ObliVM-Lang’s default back end implementation is ObliVM-SC, which is implemented in Java. Suppose an alternative BigInteger implementation in ObliVM-SC (e.g., using additively homomorphic encryption) is available in a Java class called `BigInteger`. Programmers can define

```plaintext
typedef BigInt@m = native BigInteger;
```

Suppose that this class supports four operations: `add`, `multiply`, `fromInt` and `toInt`, where the first two operations are arithmetic operations and last two operations are used to convert between Garbled Circuit-based integers and HE-based
integers. We can expose these to the source language by declaring:

```plaintext
BigInt`m BigInt`m.add(BigInt`m x, BigInt`m y) = native BigInteger.add;

BigInt`m BigInt`m.multiply(BigInt`m x, BigInt`m y) = native BigInteger.multiply;

BigInt`m BigInt`m.fromInt(int`m y) = native BigInteger.fromInt;

int`m BigInt`m.toInt(BigInt`m y) = native BigInteger.toInt;
```

5.2.2 Language features for security

The key requirement of ObliVM-Lang is that a program’s execution traces will not leak information. These execution traces include a memory trace, an instruction trace, a function stack trace, and a declassification trace. The trace definitions are similar to SCVM in Chapter 4, and we develop a security type system for ObliVM-Lang, which is similar to the one for SCVM to enforce trace obliviousness.

In addition, ObliVM-Lang provides an extended type system that imposes further constraints on how random numbers and functions should be used to achieve security. This type system extension does not enforce formal security, but it provides useful hints to capture subtle bugs, e.g., when implement oblivious data structures.

**Random numbers and implicit declassifications.** Many oblivious programs such as ORAM and oblivious data structures crucially rely on randomness. In particular, their obliviousness guarantee has the following nature: the joint distribution of memory traces is identical regardless of secret inputs (these algorithms typically have a cryptographically negligible probability of correctness failure). ObliVM-Lang
supports reasoning of such “distributional” trace-obliviousness by providing *random types* associated with an affine type system. For instance, *rnd32* is the type of a 32-bit random integer. A random number will always be secret-shared between the two parties.

To generate a random number, there is a built-in function *RND* with the following signature:

\[ \text{rnd}@\text{m} \rightarrow \text{RND(public int32 m)} \]

This function takes a public 32-bit integer \( m \) as input, and returns \( m \) random bits. Note that \( \text{rnd}@\text{m} \) is a *dependent type*, whose type depends on values, i.e. \( m \). To avoid the complexity of handling general dependent types, the ObliVM-Lang compiler restricts the usage of dependent types to only this built-in function, and handles it specially.

In our ObliVM framework, outputs of a computation can be explicitly declassified with special syntax. Random numbers are allowed *implicit declassification* – by assigning them to public variables. Here “implicitness” means that the declassification happens not because this is a specified outcome of the computation.

For security, we must ensure that each random number is implicitly declassified *at most once* for the following reason. When implicitly declassifying a random number, both parties observe the random number as part of the trace. Now consider the following example where \( s \) is a secret variable.
In this program, random variables $r_1$ and $r_2$ are initialized in Line 1 – these variables are assigned a fresh, random value upon initialization. Up to Line 4 random variables $r_1$ and $r_2$ are each declassified no more than once. Line XX, however, could potentially cause $r_2$ to be declassified more than once. Line XX clearly is not secure since in this case the observable public variable $y$ and $z$ could be correlated – depending on which secret branch was taken earlier.

Therefore, we use an affine type system to ensure that each random variable is implicitly declassified at most once. This way, each time a random variable is implicitly declassified, it will introduce a independently uniform variable to the observable trace. In our security proof, a simulator can just sample this random number to simulate the trace.

It turns out that the above example reflects the essence of what is needed to implement oblivious RAM and oblivious data structures in our source language. We refer the readers to Sections 5.3 and 5.4.2 for details.

**Function calls and phantom functions.** A straightforward idea to prevent stack behavior from leaking information is to enforce function calls in a public context. Then the requirement is that each function’s body must satisfy memory- and instruction-trace obliviousness. Further, by defining native functions, ObliVM-Lang implicitly assumes that their implementations satisfy memory- and instruction-
trace obliviousness.

Beyond this basic idea, ObliVM-Lang makes a step forward to enabling function calls within a secret if-statement by introducing the notion of phantom function. The idea is that each function can be executed in dual modes, a real mode and a phantom mode. In the real mode, all statements are executed normal with real computation and real side effects. In the phantom mode, the function execution merely simulates the memory traces of the real world; no side effects take place; and the phantom function call returns a secret-shared default value of the specified return type. This is similar to padding ideas used in several previous works [6, 78].

We will illustrate the use of phantom function with the following prefixSum example. The function prefixSum(n) accesses a global integer array a, and computes the prefix sum of the first n + 1 elements in a. After accessing each element (Line 3), the element in array a will be set to 0 (Line 4).

```
1 phantom secure int32 prefixSum
2   (public int32 n) {
3     secure int32 ret=a[n];
4     a[n]=0;
5     if (n != 0) ret = ret+prefixSum(n-1);
6     return ret;
7   }
```

The keyword phantom indicates that the function prefixSum is a phantom function.

Consider the following code to call the phantom functions:

```
if (s) then x = prefixSum(n);
```

To ensure security, prefixSum will always be called no matter s is true or
false. When $s$ is false, however, it must be guaranteed that (1) elements in array $a$ will not be assigned to be 0; and (2) the function generates traces with the same probability as when $s$ is true. To this end, the compiler will generate target code with the following signature:

$$\text{prefixSum}(idx, \text{indicator})$$

where $\text{indicator}$ means whether the function will be called in the real or phantom mode. To achieve the first goal, the global variable will be modified only if $\text{indicator}$ is false. The compiler will compile the code in line 4 into the following pseudo-code:

$$a[idx] = \text{mux}(0, a[idx], \text{indicator});$$

It is easy to see, that all instructions will be executed, and thus the generated traces are identical regardless of the value of $\text{indicator}$. Note, that such a function is not implementable in any prior loop-unrolling based compiler, since $n$ is provided at runtime only.

It is worth noticing that phantom function relaxed the restriction posed by previous memory trace oblivious type systems [58], which do not allow looping in the secure context (i.e. within a secret conditional). The main difficulty in previous systems was to quantify the numbers of loop iterations in the two branches of an if-statement, and to enforce the two numbers to be the same. Phantom functions remove the need of this analysis by executing both branches, with one branched really executed, and the other executed phantomly. As long as an adversary is unable to distinguish between a real execution from a phantom one, the secret
guard of the if-statement will not be leaked, even when loops are virtually present (i.e. in a phantom function).

5.3 User-Facing Oblivious Programming Abstractions

Programming abstractions such as MapReduce and GraphLab have been popularized in the parallel computing domain. In particular, programs written for a traditional sequential programming paradigm are difficult to parallelize automatically by an optimizing compiler. These new paradigms are not only easy for users to understand and program with, but also provide insights on the structure of the problem, and facilitate parallelization in an automated manner.

In this section, we would like to take a similar approach towards oblivious programming as well. The idea is to develop oblivious programming abstractions that can be easily understood and consumed by non-specialist programmers, and our compiler can compile programs into efficient oblivious algorithms. In comparison, if these programs were written in a traditional imperative-style programming language like C, compile-time optimizations would have been much more limited.

5.3.1 MapReduce Programming Abstractions

An interesting observation is that “parallelism facilitates obliviousness” [24, 36]. If a program (or part of a program) can be efficiently expressed in parallel programming paradigms such as MapReduce and GraphLab [2, 64] (with a few additional constraints), there is an efficient oblivious algorithm to compute this task.
We stress that in this paper, we consider MapReduce merely as a programming abstraction that facilitates obliviousness – in reality we compile MapReduce programs to sequential implementations that run on a single thread. Parallelizing the algorithms is outside the scope of this work.

**Background: Oblivious algorithms for streaming MapReduce.** A *streaming* MapReduce program consists of two basic operations, map and reduce.

- The **map** operation: takes an array denoted \( \{ \alpha_i \}_{i \in [n]} \) where each \( \alpha_i \in \mathcal{D} \) for some domain \( \mathcal{D} \), and a function \( \text{mapper} : \mathcal{D} \rightarrow K \times V \). Now \( \text{map} \) would apply \( (k_i, v_i) := \text{mapper}(\alpha_i) \) to each \( \alpha_i \), and output an array of key-value pairs \( \{(k_i, v_i)\}_{i \in [n]} \).

- The **reduce** operation: takes in an array of key-value pairs denoted \( \{(k, v_i)\}_{i \in [n]} \) and a function \( \text{reducer} : K \times V^2 \rightarrow V \). For every unique key \( k \) value in this array, let \( (k, v_{i_1}), (k, v_{i_2}), \ldots (k, v_{i_m}) \) denote all occurrences with the key \( k \). Now the \( \text{reduce} \) operation applies the following operation in a streaming fashion:

\[
R_k := \text{reducer}(k, \ldots \text{reducer}(k, \text{reducer}(k, v_{i_1}, v_{i_2}), v_{i_3}), \ldots, v_{i_m})
\]

The result of the \( \text{reduce} \) operation is an array consisting of a pair \( (k, R_k) \) for every unique \( k \) value in the input array.

Goodrich and Mitzenmacher [36] observe that any program written in a streaming MapReduce abstraction can be converted to efficient oblivious algorithms, and they leverage this observation to aid the construction of an ORAM scheme.
Figure 5.1: **Streaming MapReduce in ObliVM-Lang.** See Section 5.3.1 for oblivious algorithms for the streaming MapReduce paradigm [36].
• The map operation is inherently oblivious, and can be done by making a linear scan over the input array.

• The reduce operation can be made oblivious through an oblivious sorting (denoted o-sort) primitive.

  – First, o-sort the input array in ascending order of the key, such that all pairs with the same key are grouped together.

  – Next, in a single linear scan, apply the reducer function: i) If this is the last key-value pair for some key $k$, write down the result of the aggregation $(k, R_k)$. ii) Else, write down a dummy entry $\bot$.

  – Finally, o-sort all the resulting entries to move $\bot$ to the end.

**Providing the streaming MapReduce abstraction in ObliVM.** It is easy to implement the streaming MapReduce abstraction as a library in our source language ObliVM-Lang. The ObliVM-Lang implementation of streaming MapReduce paradigm is provided in Figure 5.1.

MapReduce has two generic constants, $m$ and $n$, to represent the sizes of the input and output respectively. It also has three generic types, $I$ for inputs’ type, $K$, for output keys’ type, and $V$, for output values’ type. All of these three types are assumed to be secret.

It takes five inputs, data for the input data, map for the mapper, reduce for the reducer, initialVal for the initial value for the reducer, and cmp to compare two keys of type $K$. 148
Lines 6-10 are the mapper phase of the algorithm, then line 11 uses the function `sort` to sort the intermediate results based on their keys. After line 11, the intermediate results with the same key are grouped together, and line 12-29 produce the output of the reduce phase with some dummy outputs. Finally, lines 30-35 use oblivious sort again to eliminate those dummy outputs, and eventually line 36 returns the final results.

Notice that in these functions, there are three arrays, `data`, `d2`, and `res`. The program declares all of them to have only public access pattern, because they are accessed by either a sequential scan, or an oblivious sorting. In this case, the compiler will not place these arrays into ORAM banks.

**Using MapReduce.** Figure 5.1 needs to be written by an expert developer only once. From then on, an end user can make use of this programming abstraction.

We further illustrate how to use the above MapReduce program to implement a histogram. In SCVM (Chapter 4), a histogram program is as below.

```java
for (public int i=0; i<n; ++i) c[i] = 0;
for (public int i=0; i<m; ++i) c[a[i]] ++;
```

This program counts the frequency of each values in `[0..n − 1]` in the array `a` of size `m`. Since the program makes dynamic memory accesses, the SCVM compiler would decide to put the array `c` inside an ORAM.

An end user can write the same program using a simple MapReduce abstraction as follows. Our ObliVM-Lang compiler would generate target code that relies
on oblivious sorting primitives rather than generic ORAM, improving the performance by a logarithmic factor in comparison with the SCVM implementation. In Section 5.5, we show that the practical performance gain ranges from $10 \times$ to $400 \times$.

```
int2 cmp(int32 x, int32 y) {
    int2 r = 0;
    if (x < y) r = -1;
    else if (x > y) r = 1;
    return r;
}
Pair<int32, int32> mapper(int32 x) {
    return Pair<int32, int32>(x, 1);
}
int32 reducer(int32 k, int32 v1, int32 v2) {
    return v1 + v2;
}
```

The top-level program can launch the computation using

```
c=MapReduce@m@n<int32, int32, int32>(a, mapper, reducer, cmp, 0);
```

### 5.3.2 Programming Abstractions for Data Structures

We now explain how to provide programming abstractions for a class of pointer-based oblivious data structures described by Wang et al. [92]. Figure 5.2 gives an example, where an expert programmer provides library support (Figure 5.3) for implementing a class of pointer-based data structures such that a non-specialist programmer can implement data structures which will be compiled to efficient oblivious algorithms that outperform generic ORAM. We stress that while we give a stack example for simplicity, this paradigm is also applicable to other pointer-based data structures, such as AVL tree, heap, and queue.
Figure 5.2: Oblivious stack by non-specialist programmers.

Implementing oblivious data structure abstractions in ObliVM. We assume that the reader is familiar with the oblivious data structure algorithmic techniques described by Wang et al. [92]. To support efficient data structure implementations, an expert programmer implements two important objects (see Figure 5.3):

- A **Pointer** object stores two important pieces of information: an **index** variable that stores the logical identifier of the memory block pointed to (each memory block has a globally unique **index**); and a **pos** variable that stores the random leaf label in the ORAM tree of the memory block.

- A **SecStore** object essentially implements an ORAM, and provides the following member functions to an end-user: The **SecStore.remove** function essen-
1 `rnd@m RND(public int32 m) = native lib.rand;`
2 `struct Pointer@m {
3       int32 index;
4       rnd@m pos;
5   };
6 struct SecStore@m<T> {
7       CircuitORAM@m<T> oram;
8       int32 cnt;
9   };
10 phantom void SecStore@m<T>.add(int32 index, int@m pos, T data) {
11       oram.add(index, pos, data);
12   }
13 phantom T SecStore@m<T> .readAndRemove(int32 index, rnd@m pos) {
14       return oram.readAndRemove(index, pos);
15   }
16 phantom Pointer@m SecStore@m<T>.allocate() {
17       cnt = cnt + 1;
18       return Pointer@m(cnt, RND(m));
19   }

Figure 5.3: Code by expert programmers to help non-specialists implement oblivious stack.

...tially is a syntactic sugar for the ORAM’s readAndRemove interface [80,90], and the SecStore.add function is a syntactic sugar for the ORAM’s Add interface [80,90]. Finally, the SecStore.allocate function returns a new Pointer object to the caller. This new Pointer object is assigned a globally unique logical identifier (using a counter cnt that is incrementd each time), and a fresh random chosen leaf label RND(m).

Stack implementation by a non-specialist programmer. Given abstractions provided by the expert programmer, a non-specialist programmer can now implement a class of data structures such as stack, queue, heap, AVL Tree, etc. Figure 5.2 gives a stack example.
Role of affine type system. We use Figure 5.3 as an example to illustrate how our rnd types with their affine type system can ensure security. As mentioned earlier, rnd types have an affine type system. This means that each rnd can be declassified (i.e., made public) at most once. In Figure 5.3, the oram.readAndRemove call will declassify its argument \texttt{rnd@m pos} inside the implementation of the function body. From an algorithms perspective, this is because the leaf label \texttt{pos} will be revealed during the \texttt{readAndRemove} operation, incurring a memory trace where the value \texttt{rnd@m pos} will be observable by the adversary.

5.3.3 Loop Coalescing and New Oblivious Graph Algorithms

We introduce a new programming abstraction called loop coalescing, and show how this programming abstraction allowed us to design novel oblivious graph algorithms such as Dijkstra’s shortest path and minimum spanning tree for sparse graphs. Loop coalescing is non-trivial to embed as a library in ObliVM-Lang. We therefore support this programming abstraction by introducing special syntax and modifications to our compiler. Specifically, we introduce a new syntax called \texttt{bounded-for} loop as shown in Figure 5.4. \textit{For succinctness, in this section, we will present pseudo-code.}

In the program in Figure 5.4, the \texttt{bwhile(n)} and \texttt{bwhile(m)} syntax at Lines 1 and 3 indicate that the outer loop will be executed for a total of \texttt{n} times, whereas the inner loop will be executed for a total of \texttt{m} times – over all iterations of the outer loop.
Figure 5.4: Loop coalescing. The outer loop will be executed at most $n$ times in total, the inner loop will be executed at most $m$ times in total – over all iterations of the outer loop. A naive approach compiler would pad the outer and inner loop to $n$ and $m$ respectively, incurring $O(nm)$ cost. Our loop coalescing technique achieves $O(n + m)$ cost instead.

<table>
<thead>
<tr>
<th>Algorithms</th>
<th>Our Complexity</th>
<th>Best Known</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dijkstra’s Algorithm</td>
<td>$O((E + V) \log^2 V)$</td>
<td>$O((E + V) \log^3 V)$ (Generic ORAM baseline [90])</td>
</tr>
<tr>
<td>(Sparse Graph)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Prim’s Algorithm</td>
<td>$O((E + V) \log^2 V)$</td>
<td>$O(E \log^3 V)$ for $E = O(V \log^2 V), \gamma \geq 0$ [37]</td>
</tr>
<tr>
<td>(Sparse Graph)</td>
<td></td>
<td>$O(E \log^3 V)$ for $E = O(V^{2\delta^2})$, $\delta \in (0, 1)$ [37]</td>
</tr>
<tr>
<td></td>
<td></td>
<td>$O(E \log^2 V)$ for $E = \Omega(V^{1+\epsilon})$, $\epsilon \in (0, 1)$ [37]</td>
</tr>
</tbody>
</table>

Table 5.1: Summary of algorithmic results. All costs reported are in terms of circuit size. The asymptotic notation omits the bit-length of each word for simplicity. Our oblivious Dijkstra’s algorithm and oblivious Prim’s algorithm can be composed using our novel loop coalescing programming abstraction and oblivious data structures.
Algorithm 1 Dijkstra’ algorithm with bounded for

**Secret Input:** $s$: the source node

**Secret Input:** $e$: concatenation of adjacency lists stored in a single ORAM array.

Each vertex’s neighbors are stored adjacent to each other.

**Secret Input:** $s[u]$: sum of out-degree over vertices from 1 to $u$.

**Output:** $\text{dis}$: the shortest distance from source to each node

1. $\text{dis} := [\infty, \infty, ..., \infty]$
2. $\text{PQ}.\text{push}(0, s)$
3. $\text{dis}[s] := 0$
4. bwhile($V$(!PQ.empty()))
5. $(\text{dist}, u) := \text{PQ}.\text{deleteMin}()$
6. if($\text{dis}[u] == \text{dist}$) then
7.   $\text{dis}[u] := -\text{dis}[u]$;
8. bfor($E$)(i := $s[u]; i < s[u + 1]; i = i + 1)$
9.   $(u, v, w) := e[i];$
10.   newDist := dist + w
11.   if (newDist < $\text{dis}[v]$) then
12.     $\text{dis}[v] := \text{newDist}$
13.     PQ.insert(newDist, u)

To deal with loop coalescing, the compiler partitions the code within an bounded-loop into code blocks, each of which will branch at the end. The number of execution times for each code block will be computed as the bound number for the inner most bounded-loop that contains the code block. Then the compiler will transform a bounded loop into a normal loop, whose body simulates a state machine that each state contains a code block, and the branching statement at the end of each code block will be translated into an assignment statement that moves the state machine into a next state. The total number of iterations of the emitted normal loop is the summation of the execution times for all code blocks. Figure 5.4 illustrates this compilation process.

We now show how this loop coalescing technique leads to new novel oblivious graph algorithms.
Algorithm 2 Oblivious Dijkstra’ algorithm

**Secret Input:** e, s: same as Algorithm 1  
**Output:** dis: the shortest distance from s to each node

1: dis := [∞, ∞, ..., ∞]; dis[source] = 0  
2: PQ.push(0,s); innerLoop := false  
3: for i := 0 → 2V + E do  
   4: if not innerLoop then  
      5: (dist, u) := PQ.deleteMin()  
      6: if dis[u] == dist then  
          7: dis[u] := −dis[u]; i := s[u]  
          8: innerloop := true;  
          9: end if  
   10: else  
      11: if i < s[u + 1] then  
          12: (u, v, w) := e[i]  
          13: newDist := dist + w  
          14: if newDist < dis[u] then  
              15: dis[u] := newDist  
              16: PQ.insert(newDist, v')  
          17: end if  
          18: i = i + 1  
      19: else  
          20: innerloop := false;  
      21: end if  
   22: end if  
23: end for  

Oblivious Dijkstra shortest path for sparse graphs. It is an open problem how to compute single source shortest path (SSSP) obliviously for sparse graphs more efficiently than generic ORAM approaches. Blanton et al. [12] designed a solution for a dense graph, but their technique cannot be applied when the graph is sparse.

Recall that the priority-queue-based Dijkstra’s algorithm has to update the weight whenever a shorter path is found to any vertex. In an oblivious version of Dijkstra’s, this operation dominates the overhead, as it is unclear how to realize it more efficiently than using generic ORAMs. Our solution to oblivious SSSP is
more efficient thanks to (1) avoiding this weight update operation, and (2) a *loop coalescing* technique.

*Avoiding weights updating.* This is accomplished by two changes to a standard priority-queue-based Dijkstra’s algorithm, i.e., lines 6-7 and line 12 in Algorithm 1. The basic idea is, whenever a shorter distance `newDist` from `s` to a vertex `u` is found, instead of updating the existing weight of `u` in the heap, we insert a new pair `(newDis, u)` into the priority queue. This change can result in multiple entries for the same vertex in the queue, leading to two concerns: (1) the size of the priority queue cannot be bounded by `V`; and (2) the same vertex might be popped and processed multiple times from the queue. Regarding the first concern, we note the size of the queue can be bounded by `E = O(V^2)` (since `E = o(V^2)` for sparse graphs). Hence, each priority queue `insert` and `deleteMin` operation can still be implemented obliviously in `O(log^2 V)` [92]. The second concern is resolved by the check in lines 6-7: every vertex will be processed at most once because `dis[v]` will be set negative once vertex `v` is processed.

*Loop coalescing.* In Algorithm 1, the two nested loops (line 4 and line 8) use secret data as guards. In order not to leak the secret loop guards, a naive approach is to iterate each loop a maximal number of times (i.e., `V + E`, as `V` alone is considered secret).

Using our loop coalescing technique, we can derive an oblivious Dijkstra’s algorithm that asymptotically outperforms a generic ORAM baseline for sparse graphs. The resulting oblivious algorithm is described in Algorithm 2. Note that at most `V` vertices and `E` edges will be visited, we coalesce the two loops into a single
one. The code uses a state variable innerloop to indicate whether a vertex or an
edge is being processed. Each iteration deals with one of a vertex (lines 5-8), an edge
(lines 15-18), or the end of a vertex’s edges (line 13). So there are $2V + E$ iterations
in total. Note the ObstVM-Lang compiler will pad the if-branches in Algorithm 2
to ensure obliviousness. Further, an oblivious priority queue is employed for PQ.

Cost analysis. In Algorithm 2, each iteration of the loop (lines 3-18) makes
a constant number of ORAM accesses and two priority queue primitives (insert
and deleteMin, both implemented in $O(\log^2 V)$ time). So, the total runtime is
$O((V + E)\log^2 V)$.

Additional algorithmic results. Summarized in Table 5.1, our loop coalescing
technique also immediately gives a new oblivious Minimum Spanning Tree (MST)
algorithm whose full description is omitted.

5.4 Implementing Rich Circuit Libraries

5.4.1 Case Study: Basic Arithmetic Operations

The rich language features provided by ObstVM-Lang make it possible to im-
plement complex arithmetic operations easily and efficiently. We give a case study
to demonstrate how to use ObstVM-Lang to implement Karatsuba multiplication.

Implementing Karatsuba multiplication. Figure 5.5 contains the implement-
tion of Karatsuba multiplication [50] in ObstVM-Lang. Karatsuba multiplication
implements the following recursive algorithm to compute multiplication of two $n$
bit numbers, $x$ and $y$, taking $O(n^{\log_2 3})$ amount of time. As a quick overview, the
int @(2 * n) karatsubaMult@(n(n x, int @(n y)) {
    int @(2 * n ret;
    if (n < 18) {
        ret = x*y;
    } else {
        int @(n - n/2) a = x$n/2\$n/2$;
        int @(n/2) b = x$0^n/2$;
        int @(n - n/2) c = y$n/2\$n/2$;
        int @(n/2) d = y$0^n/2$;
        int @(2*(n - n/2)) t1 =
            karatsubaMul@t(n - n/2)(a, c);
        int @(2*(n/2)) t2 =
            karatsubaMul@(n/2)(b, d);
        int @(n - n/2 + 1) aPb = a + b;
        int @(n - n/2 + 1) cPd = c + d;
        int @(2*(n - n/2 + 1)) t3 =
            karatsubaMul@(n - n/2 + 1)(aPb, cPd);
        int @(2*n) padt1 = t1;
        int @(2*n) padt2 = t2;
        int @(2*n) padt3 = t3;
        ret = (padt1<<(n/2*2)) + padt2 +
            ((padt3 - padt1 - padt2)<<((n/2));
    }
    return ret;
}

Figure 5.5: Karatsuba multiplication in ObliVM-Lang.

algorithm works as follows. First, express the n-bit integers x and y as the concatenation of \( \frac{n}{2} \)-bit integers: \( x = a*2^{n/2}+b \), \( y = c*2^{n/2}+d \). Now, \( x*y \) can be calculated as follows:

\[
t1 = a*c; \quad t2 = b*d; \quad t3 = (a+b)*(c+d);
\]

\[
x*y = t1<<n + t2 + (t3-t1-t2)<<((n/2));
\]

where the multiplications \( a*c \) and \( b*d \) are implemented through a recursive call to the Karatsuba algorithm itself (until the bit-length is small enough).

To implement Karatsuba efficiently, we need to perform operations on a subset of bits. We hence introduce the following syntactic sugar in ObliVM-Lang: In lines
```c
#define BUCSIZE 3
#define STASHSIZE 33

struct Block@n<T>{
    int@n id, pos;
    T data;
};

struct CircuitOram@n<T>{
    dummy Block@n<T>[public 1<<n+1]
        [public BUCSIZE] buckets;
    dummy Block@n<T>[public STASHSIZE] stash;
};
```

Figure 5.6: Part of our Circuit ORAM implementation (Type Definition) in ObliVM-Lang.

6 to 9 of Figure 5.5, the syntax num$i$~$j$ means extracting the part of integer num from i-th bit to j-th bit.

5.4.2 Case Study: Circuit ORAM

In Figure 5.7, we show part of the Circuit ORAM implementation using ObliVM-Lang. Line 3 to line 6 is the definition of a ORAM block containing two metadata fields of an index of type int, and a position label of type rnd, along with a data field of type <T>.

Circuit ORAM (line 7-10) is organized to contain an array of buckets (i.e. arrays of ORAM blocks), and a stash (i.e. an array of blocks). The dummy keyword in front of Block@n<T> indicates the value of this type can be null. In many cases, (e.g. Circuit ORAM implementation), using dummy keyword leads to a more efficient circuit generation.

Line 11-30 demonstrates how readAndRemove can be implemented. Taking the input of an secret integer index id, and a random position label pos, the label
phantom T CircuitOram@n<T>.
   .readAndRemove(int@n id, rnd@n pos) {
      public int32 pubPos = pos;
      public int32 i = (1 << n) + pubPos;
      T res;
      for (public int32 k = n; k>=0; k=k-1) {
         for (public int32 j=0;j<BUCSIZE;j=j+1)
            if (buckets[i][j] != null &&
                buckets[i][j].id == id)
               res = buckets[i][j].data;
               buckets[i][j] = null;
         i=(i-1)/2;
      }
      for (public int32 i=0;i<STASHSIZE;i=i+1)
         if (stash[i]!=null&&stash[i].id==id) {
            res = stash[i].data;
            stash[i] = null;
         }
      return res;
   }

Figure 5.7: Part of our Circuit ORAM implementation (ReadAndRemove) in ObliVM-Lang.

pos is first declassified into public. Then affine type system allows declassifying pos once, i.e. pos is never used for the rest of the program. Further in a function calling readAndRemove with inputs arg1 and arg2, arg2 cannot be used either for the rest of the program. This is crucial to enforce that every position labels will use revealed only once after its generation, and, to our best knowledge, no prior work enables such an enforcement in a compiler.

This Circuit ORAM implementation can be type-checked by ObliVM-Lang’s extended type checker, which gives users stronger confidence that the implementation does not leak information through its execution traces.
5.5 Evaluation

5.5.1 Back End Implementation

Our compiler emits code to a Java-based secure computation back end called ObliVM-SC. ObliVM-SC is designed to be extensible through a central notion called computational environments. Conceptually, our compiler emits circuit designs; whereas a computation environment decides how a circuit design, namely, how each AND and XOR gate will be executed. In other words, computational environments provide a separation between circuit designs and their executions, allowing circuit gadgets to be potentially reusable for multiple cryptographic protocols, such as Garbled Circuit [95] or GMW [34]. Currently, ObliVM provides a Garbled Circuit protocol with semi-honest security. However, adapting a circuit design to a different protocol such as GMW would simply require changing to an alternative computation environment, and does not involve modification of the compiler.

5.5.2 Metrics and Experiment Setup

**Number of AND gates.** In Garbled Circuit-based secure computation, functions are represented in boolean circuits consisting of XOR and AND gates. Due to well-known Free XOR techniques [8, 17, 53], the cost of evaluating XOR gates are insignificant in comparison with AND gates. Therefore, a primary performance metric is the number of AND gates. This metric is platform independent, i.e., independent of the artifacts of the underlying software implementation, or the hardware
configurations where the benchmark numbers are measured. This metric facilitates a fair comparison with existing works based on boolean circuits, and is one of the most popular metrics used in earlier works [44,46,54,55,59,71,72,91,92].

**Wall-clock runtime.** Unless noted otherwise, all wall-clock numbers are measured by executing the protocols between two Amazon EC2 machines of types c4.8xlarge and c3.8xlarge. This metric is platform and implementation dependent, and therefore we will explain how to best interpret wallclock runtimes, and how these runtimes will be affected by the underlying software and hardware configurations.

**Compilation time.** For all programs we ran, the compilation time is under 1 second. Therefore, we do not separately report the compilation time for each program.

### 5.5.3 Comparison with Previous Automated Approaches

The first general-purpose secure computation system, Fairplay, was built in 2004 [65]. Since then, several improved systems were built [13,43,44,46,54,55,98]. Except for our prior work SCVM, existing systems provide no support for ORAM – and therefore, each dynamic memory access would be compiled to a linear scan of memory.

We now evaluate the speedup ObliVM achieves relative to previous approaches. To illustrate the sources of the speedup, we consider the following sequence of progressive baselines. We start from Baseline 1 which is representative of a state-of-the-art automated secure computation system. We then add one feature at a time to the baseline, resulting in the next baseline, until we arrive at Baseline 5 which is
<table>
<thead>
<tr>
<th>Application</th>
<th>Oblivious programming abstractions and compiler optimizations demonstrated</th>
<th>Parameters for Figure 5.8</th>
<th>Parameters for Table 5.3 and Table 5.4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dijkstra’s Algorithm MST</td>
<td>Loop coalescing abstraction (see Section 5.3.3).</td>
<td>$V = 2^{14}, E = 3V$</td>
<td>$V = 2^{10}, E = 3V$</td>
</tr>
<tr>
<td>Heap Map/Set Binary Search</td>
<td>Oblivious data structure abstraction (see Section 5.3.2).</td>
<td>$N = 2^{27}, K = 32, V = 480$</td>
<td>$N = 2^{23}, K = 32, V = 992$</td>
</tr>
<tr>
<td>AMS Sketch</td>
<td>Compile-time optimizations: split data into separate ORAMs [59].</td>
<td>$\epsilon = 6 \times 10^{-5}, \delta = 2^{-20}$</td>
<td>$\epsilon = 2.4 \times 10^{-4}, \delta = 2^{-20}$</td>
</tr>
<tr>
<td>Count Min Sketch</td>
<td></td>
<td></td>
<td>$\epsilon = 3 \times 10^{-6}, \delta = 2^{-20}$</td>
</tr>
<tr>
<td>K-Means</td>
<td>MapReduce abstraction (see Section 5.3.1).</td>
<td>$N = 2^{18}$</td>
<td>$N = 2^{16}$</td>
</tr>
</tbody>
</table>

Table 5.2: **List of applications used in Figures 5.8.** For graph algorithms, $V, E$ stand for number of vertices and edges; for data structures, $N, K, V$ stand for capacity, bit-length of key and bit-length of value; for streaming algorithms, $\epsilon, \delta$ stand for relative error and failure probability; for K-Means, $N$ stands for number of points.
essentially our ObliVM system.

- **Baseline 1: A state-of-the-art automated system with no ORAM support.** Baseline 1 is intended to characterize a state-of-the-art automated secure computation system with no ORAM support. We assume a compiler that can detect public memory accesses (whose addresses are statically inferable), and directly make such memory accesses. For each each dynamic memory access (whose address depends on secret inputs), a linear scan of memory is employed. Baseline 1 is effectively a lower-bound estimate of the cost incurred by CMBC-GC [44], a state-of-the-art system in 2012.

- **Baseline 2: With GO-ORAM [35].** In Baseline 2, we implement the GO-ORAM scheme on top of Baseline 1. Dynamic memory accesses made by a program will be compiled to GO-ORAM accesses. We make no additional compile-time optimizations.

- **Baseline 3: With Circuit ORAM [90].** Baseline 3 is essentially the same as Baseline 2 except that we now replace the ORAM scheme with a state-of-the-art Circuit ORAM scheme [90].

- **Baseline 4: Language and compiler.** Baseline 4 assumes that the ObliVM language and compiler are additionally employed (on top of Baseline 3), resulting in additional savings due to our compile-time optimizations as well as our oblivious programming abstractions.

- **Baseline 5: Back end optimizations.** In Baseline 5, we employ additional
Figure 5.8: Sources of speedup in comparison with state-of-the-art in 2012 [44]: an in-depth look.

back end optimizations atop Baseline 4. Baseline 5 reflects the performance of the actual ObliVM system.

We consider a set of applications in our evaluation as described in Table 5.2. We select several applications to showcase our oblivious programming abstractions, including MapReduce, loop coalescing, and oblivious data structure abstractions. For all applications, we choose moderately large data sizes ranging from 768KB to 10GB. For data structures (e.g., Heap, Map/Set) and binary search, for Baseline 1, we assume that each operation (e.g., search, add, delete) is done with a single linear scan. For Baseline 2 and 3, we assume that a typical sub-linear implementation is adopted. For all other applications, we assume that Baseline 1 to 3 adopt the most straightforward implementation of the algorithm.

Results. Figure 5.8 shows the speedup we achieve relative to a state-of-the-art automated system that does not employ ORAM [44]. This speedup comes from the following sources:

No ORAM to GO-ORAM: For most of the cases, the data size considered was
not big enough for GO-ORAM to be competitive to a linear-scan ORAM. The only exception was AMS sketch, where we chose a large sketch size. In this case, using GO-ORAM would result in a $300 \times$ speedup in comparison with no ORAM (i.e., linear-scan for each dynamic memory access). This part of the speedup is reflected in purple in Figure 5.8. Here the speedup stems from a reduction in circuit size (as measured by the number of AND gates).

*Circuit ORAM:* The red parts in Figure 5.8 reflect the multiplicative speedup attained when we instead use Circuit ORAM (as opposed to no ORAM or GO-ORAM, whichever is faster). This way, we achieve an additional $51 \times$ to $530$ performance gains – reflected by a reduction in the total circuit size.

*Language and compiler:* As reflected by the blue bars in Figure 5.8, our oblivious programming abstractions and compile-time optimizations bring an additional $2 \times$ to $2500 \times$ performance savings on top of a generic Circuit ORAM-based approach. This speedup is also measurable in terms of reduction in the circuit size.

*Back end optimizations:* Our ObliVM-SC is a better architected and more optimized version of its predecessor FastGC [46] which is employed by CMBC-GC [44]. FastGC [46] reported a garbling speed of 96K AND gates/sec, whereas ObliVM garbles at 670K AND gates/sec on a comparable machine. In total, we achieve an $7 \times$ overall speedup compared with FastGC [46].

We stress, however, that ObliVM’s main contribution is not the back end implementation. In fact, it would be faster to hook up ObliVM’s language and compiler with a JustGarble-like system that employs a C-based implementation and hardware
AES-NI. However, presently JustGarble does not provide a fully working end-to-end protocol. Therefore, it is an important direction of future work to extend JustGarble to a fully working protocol, and integrate it into ObliVM.

**Comparison with SCVM.** In comparison with SCVM, ObliVM’s offers the following new features: 1) new oblivious programming abstractions; 2) Circuit ORAM implementation that is 20× to 30× times faster than SCVM’s binary-tree ORAM implementation for 4MB to 4GB data sizes; and 3) ability to implement low-level gadgets including the ORAM algorithm itself in the source language.

Since the design of efficient ORAM algorithms is mainly the contribution of the Circuit ORAM paper [90], here we focus on evaluating the gains from programming abstractions. Therefore, instead of comparing with SCVM per se, we compare with SCVM + Circuit ORAM instead (i.e., SCVM with its ORAM implementation updated to the latest Circuit ORAM).

### 5.5.4 ObliVM vs. Hand-Crafted Solutions

We show that ObliVM achieves competitive performance relative to hand-crafted solutions for a wide class of common tasks. We also show that ObliVM significantly reduces development effort in comparison with previous secure computation frameworks.

**Competitive performance.** For a set of applications, including Heap, Map/Set, AMS Sketch, Count-Min Sketch, and K-Means, we compared implementations auto-generated by ObliVM with implementations hand-crafted by human experts. Here
the human experts are authors of this paper. We assume that the human experts have wisdom of employing the most efficient, state-of-the-art oblivious algorithms when designing circuits for these algorithms. For example, Histogram and K-Means algorithms are implemented with oblivious sorting protocols instead of generic ORAM. Heap and Map/Set employ state-of-the-art oblivious data structure techniques [92]. The graph algorithms including Dijkstra and MST employ novel oblivious algorithms proposed in this paper. In comparison, our ObliVM programs for the same applications do not require special security expertise to create. The programmer simply has to express these tasks in the programming abstractions we offer whenever possible. Over the suite of application benchmarks we consider, our ObliVM programs are competitive to hand-crafted implementations – and the performance difference is only $0.5\%-2\%$ throughout.

We remark that the hand-crafted circuits are not necessarily the optimal circuits for each computation task. However, they do represent asymptotically the best known algorithms (or new algorithms that are direct implications of this paper). It is conceivable that circuit optimization techniques such as those proposed in Tiny-Garble [82] can further reduce circuit sizes by a small constant factor (e.g., 50%). We leave this part as an interesting direction of future research.

**Developer effort.** We use two concrete case studies to demonstrate the significant reduction of developer effort enabled by ObliVM.

*Case study: ridge regression.* Ridge regression [41] takes as input a large number of data points and finds the best-fit linear curve for these points. The algorithm
is an important building block in various machine-learning tasks [72]. Previously, Nikolaenko et al. [72] developed a system to securely evaluate ridge regression, using the FastGC framework [46], which took them roughly three weeks [5]. In contrast, we spent two student-hours to accomplish the same task using ObliVM. In addition to the speedup gain from ObliVM-SC back end, our optimized libraries result in $3 \times$ smaller circuits with aligned parameters. We defer the detailed comparison to the online technical report [61].

Case study: oblivious data structures. Oblivious AVL tree (i.e, the Map/Set data structure) is an example algorithm that was previously too complex to program as circuits, but now becomes very easy with ObliVM. In an earlier work [92], we designed an oblivious AVL tree algorithm, but were unable to implement it due to high programming complexity. Now, with ObliVM, we implement an AVL tree with 311 lines of code in ObliVM-Lang, consuming under 10 student-hours (including the implementation as well as debugging).

We stress that it is not possible to implement oblivious AVL tree in previous languages for secure computation, including the state-of-the-art Wysteria [75].

5.5.5 End-to-End Application Performance

Currently in ObliVM-SC, we implemented a standard garbling scheme with Garbled Row Reduction [69] and FreeXOR [53]. We also implemented an OT extension protocol proposed by Ishai et al. [48] and a basic OT protocol by Naor and Pinkas [68].
<table>
<thead>
<tr>
<th>Program</th>
<th>Input size</th>
<th>CMBC-GC (estimate)</th>
<th>ObliVM Framework</th>
<th>ObliVM + JustGarble (estimate)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>#AND gates</td>
<td>Total time</td>
<td>#AND gates</td>
</tr>
<tr>
<td>Integer addition</td>
<td>1024 bits</td>
<td>2977</td>
<td>31ms</td>
<td>1024</td>
</tr>
<tr>
<td>Integer mult.</td>
<td>1024 bits</td>
<td>6.4M</td>
<td>66.4s</td>
<td>572K</td>
</tr>
<tr>
<td>Integer Comparison</td>
<td>16384 bits</td>
<td>32K</td>
<td>335.7ms</td>
<td>16384</td>
</tr>
<tr>
<td>Floating point addition</td>
<td>64 bits</td>
<td>10K</td>
<td>104ms</td>
<td>3035</td>
</tr>
<tr>
<td>Floating point mult.</td>
<td>64 bits</td>
<td>10K</td>
<td>104ms</td>
<td>4312</td>
</tr>
<tr>
<td>Hamming distance</td>
<td>1600 bits</td>
<td>30K</td>
<td>310ms</td>
<td>3200</td>
</tr>
</tbody>
</table>

**Basic instructions**

<table>
<thead>
<tr>
<th>Program</th>
<th>Input size</th>
<th>CMBC-GC (estimate)</th>
<th>ObliVM Framework</th>
<th>ObliVM + JustGarble (estimate)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>#AND gates</td>
<td>Total time</td>
<td>#AND gates</td>
</tr>
<tr>
<td>K-Means</td>
<td>0.5MB</td>
<td>550B</td>
<td>66d</td>
<td>2269M</td>
</tr>
<tr>
<td>Dijkstra’s Algorithm</td>
<td>48KB</td>
<td>755B</td>
<td>91d</td>
<td>10B</td>
</tr>
<tr>
<td>MST</td>
<td>48KB</td>
<td>755B</td>
<td>91d</td>
<td>9.6B</td>
</tr>
<tr>
<td>Histogram</td>
<td>0.25MB</td>
<td>137B</td>
<td>16.5d</td>
<td>866M</td>
</tr>
</tbody>
</table>

**Linear or super-linear algorithms**

<table>
<thead>
<tr>
<th>Program</th>
<th>Input size</th>
<th>CMBC-GC (estimate)</th>
<th>ObliVM Framework</th>
<th>ObliVM + JustGarble (estimate)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>#AND gates</td>
<td>Total time</td>
<td>#AND gates</td>
</tr>
<tr>
<td>Heap</td>
<td>1GB</td>
<td>32B</td>
<td>3.9d</td>
<td>12.5M</td>
</tr>
<tr>
<td>Map/Set</td>
<td>1GB</td>
<td>32B</td>
<td>3.9d</td>
<td>23.9M</td>
</tr>
<tr>
<td>Binary Search</td>
<td>1GB</td>
<td>32B</td>
<td>3.9d</td>
<td>1562K</td>
</tr>
<tr>
<td>Count Min Sketch</td>
<td>0.31GB</td>
<td>9.9B</td>
<td>30.8h</td>
<td>8088K</td>
</tr>
<tr>
<td>AMS Sketch</td>
<td>1.25GB</td>
<td>40B</td>
<td>5.18d</td>
<td>9949K</td>
</tr>
</tbody>
</table>

Table 5.3: Application performance. Actual measured numbers are in bold. The remainder are estimated numbers and should be interpreted with care. ObliVM numbers for basic instructions and sublinear-time algorithms are the mean of 20 runs. Since for all these applications, our measurements have small spread (all runs are within 6% from the mean), we use a single run for linear-time and super-linear algorithms (the same for Table 5.4).
Setup. For evaluation, here we consider a scenario where a client secret shares its data between two non-colluding cloud providers a priori. For cases where inputs are a large dataset (e.g., Heap, Map/Set, etc), depending on the application, the client may sometimes need to place the inputs in an ORAM, and secret-share the resulting ORAM among the two cloud providers. We do not measure this setup cost in our evaluation – this cost can depend highly on the available bandwidth between the client and the two cloud providers. Therefore, our evaluation begins assuming this one-time setup has completed.

End-to-end application performance. In Table 5.3, we consider three types of applications, basic instructions (e.g., addition, multiplication, and floating point operations); linear or super-linear algorithms (e.g., Dijkstra, K-Means, Minimum Spanning Tree, and Histogram); and sublinear-time algorithms (e.g., Heap, Map/Set, Binary Search, Count Min Sketch, AMS Sketch). We report the circuit size, online and total costs for a variety of applications at typical data sizes.

In Table 5.3, we also compare ObliVM with a state-of-the-art automated secure computation system CMBC-GC [44]. We note that the authors of CMBC-GC did not run all of these application benchmarks, so we project the performance of CMBC-GC using the following estimate: we first change our compiler to adopt a linear scan of memory upon dynamic memory accesses – this allows us to obtain an estimate of the circuit size CMBC-GC would have obtained for the same applications. For the set of application benchmarks (e.g., K-Means, MST, etc) CMBC-GC did report in their paper, we confirmed that our circuit size estimates are always a lower bound of what CMBC-GC reported. We then estimate the runtime of CMBC-
GC based on their reported 96K AND gates per sec – assuming that a network bandwidth of at least 2.8MBps is provisioned.

As mentioned earlier, the focus of this paper is our language and compiler, not the back end cryptographic implementation. It should be relatively easy to integrate our language and compiler with a JustGarble-like back end that employs hardware AES-NI. In Table 5.3, we also give an estimate of the performance we anticipate if we ran our ObliVM-generated circuits over a JustGarble-like back end. This is calculated using our circuit sizes and the 11M AND gates/sec performance number reported by JustGarble [11].

- **Online cost.** To measure online cost, we assume that all work that is independent of input data is performed offline, including garbling and input-independent OT preprocessing. Our present ObliVM implementation achieves an online speed of 1.8M gates/sec consuming roughly 54MBps network bandwidth.

- **Offline cost.** When no work is deferred to an offline phase, ObliVM achieves a garbling speed of 670K gates/sec consuming 19MBps network bandwidth.

**Slowdown relative to a non-secure baseline.** For completeness, we now describe ObliVM’s slowdown in comparison with a non-secure baseline where computation is performed in cleartext. As shown in Table 5.4, our slowdown relative to a non-secure baseline is application dependent, and ranges from $45 \times$ to $9.3 \times 10^6 \times$. We also present the *estimated* slowdown if a JustGarble-like back end is used for ObliVM-generated circuits. These numbers are estimated based on our circuit sizes
<table>
<thead>
<tr>
<th>Task</th>
<th>Cleartext Time</th>
<th>ObliVM Runtime</th>
<th>ObliVM+JustGB (estimate)</th>
<th>Time</th>
<th>Runtime</th>
<th>Slowdown</th>
<th>Time</th>
<th>Runtime</th>
<th>Slowdown</th>
</tr>
</thead>
<tbody>
<tr>
<td>K-Means (Online)</td>
<td>0.4ms</td>
<td>24min</td>
<td>3.6 × 10^6</td>
<td>1.9min</td>
<td>2.9 × 10^5</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>K-Means (Total)</td>
<td>0.4ms</td>
<td>62min</td>
<td>9.3 × 10^6</td>
<td>4.58min</td>
<td>6.9 × 10^5</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Distributed GWAS (Online)</td>
<td>40ms</td>
<td>1.8s</td>
<td>45</td>
<td>0.14s</td>
<td>3.5</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Distributed GWAS (Total)</td>
<td>40ms</td>
<td>5.2s</td>
<td>130</td>
<td>0.28s</td>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Binary Search (Online)</td>
<td>10µs</td>
<td>1.3s</td>
<td>1.3 × 10^5</td>
<td>78.8ms</td>
<td>7.9 × 10^3</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Binary Search (Total)</td>
<td>10µs</td>
<td>7.4s</td>
<td>7.4 × 10^5</td>
<td>189ms</td>
<td>1.9 × 10^4</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AMS Sketch (Online)</td>
<td>80µs</td>
<td>9.5s</td>
<td>1.2 × 10^5</td>
<td>0.5s</td>
<td>6.3 × 10^3</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AMS Sketch (Total)</td>
<td>80µs</td>
<td>36.8s</td>
<td>4.6 × 10^5</td>
<td>1.2s</td>
<td>1.5 × 10^4</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Hamming (Online)</td>
<td>0.3µs</td>
<td>1.71ms</td>
<td>6 × 10^3</td>
<td>0.16ms</td>
<td>5.3 × 10^2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Hamming (Total)</td>
<td>0.3µs</td>
<td>5.07ms</td>
<td>1.7 × 10^4</td>
<td>0.39ms</td>
<td>1.3 × 10^3</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 5.4: **Slowdown of secure computation compared with non-secure, cleartext computation.** Parameter choices are the same as Table 5.3. Online cost only includes operations that are input-dependent. All time measurements assume data are pre-loaded to the memory. ObliVM requires a bandwidth of 19MBps. Numbers for JustGarble are estimated using ObliVM-generated circuit sizes assuming 315MBps bandwidth.

as well as the reported 11M AND gates/sec performance metric reported by JustGarble [11].

In particular, we elaborate on the following interesting cases. First, the distributed genome-wide association study (GWAS) application is Task 1 in the iDash secure genomic analysis competition [1], with total data size 380KB. This task achieves a small slowdown, because part of the computation is done locally – specifically, Alice and Bob each performs some local preprocessing to obtain the allele frequencies of their own data, before engaging in a secure computation protocol to compute $\chi^2$-statistics. For details, we refer the reader to our online short note on how we implemented the competition tasks. On the other hand, benchmarks with floating point operations such as K-Means incur a relatively larger slowdown because modern processors have special floating point instructions which makes it
favorable to the insecure baseline.

5.6 Conclusion

We design ObiVM, a programming framework for automated secure computation. Additional examples can be found at our project website http://www.oblivm.com, including popular streaming algorithms, graph algorithms, data structures, machine learning algorithms, secure genome analysis [1], etc.
Chapter 6: Conclusion Remarks and Future Directions

6.1 Summary

In this thesis, we investigate in a set of cloud-related security applications in which programs’ execution traces may leak information. We propose principled approaches to achieve performant trace oblivious program execution. In particular, we exploit the intrinsic obliviousness within each program, so that expensive cryptographic ORAM constructions and their overheads can be saved. Security type systems are developed to enforce that our optimization does not violate privacy and security requirements in the application domains.

Based on these principled methods, we build GhostRider, a hardware-software co-designed system, as a hardware-based solution, and ObliVM, a RAM-model secure computation framework, as a cryptography-based solution to mitigate attacks from cloud’s insiders and intruders. Both systems demonstrate superior improvements over previous ones by orders of magnitudes.
6.2 Future Direction

While this thesis greatly expanded the study in trace oblivious execution, several future directions are promising.

6.2.1 Verifying Hardware ORAM Implementation

While we have demonstrated that ObliVM’s type system can help indentifying bugs in Circuit ORAM implementations, it remains an interesting question whether the obliviousness of ORAM algorithms can be verified automatically. An particularly interesting and important direction is to verify a hardware ORAM controller, such as in GhostRider, is implemented secure. Several challenges may arise during this investigation. First, no static analysis-based approach has been developed to verify the obliviousness of ORAM algorithms. The main challenge is to enforce random numbers are handled correctly. Second, there is a gap between the languages that we have been studying and popular hardware programming languages, such as Verilog. This gap may introduce more technical difficulties to design such a verifier.

6.2.2 Parallel Trace Oblivious Execution

So far, we have focused on sequential programs. While parallel programs become the new main fashion in applications such as big data and deep learning, it is interesting and important to study how to achieve trace obliviousness for parallel programs. This is not easy. In particular, the design of concurrent ORAM algorithms are still in its theoretical phase, and most existing concurrent ORAM
algorithms are not practically. Therefore, syntactic hints from the executed program to exploit its obliviousness are more promising than relying on parallel ORAM algorithms. In fact, many big-data programming frameworks, such as MapReduce [25], already force programmers to express their computations into mappers and reducers, which are both able to be executed in parallel without leaking any information through the execution traces. GraphSC [70] has made the first attempt to adopt this idea to present a set of programming interfaces so that programs using these interfaces can be turned into their parallel oblivious version automatically without incurring too much overhead.

6.2.3 Differentially Privately Oblivious Execution

In this thesis, we focus on programs that leak absolutely no information through their execution traces. Most practical programs, however, do not have a counterpart satisfying this property: it is very easy for the program to include a loop whose guard depends on some secret data. Therefore, it is interesting to seek for a weaker version of trace obliviousness.

Privacy researches in recent ten years advocate for weaker privacy notions such as differential privacy to be enforced in real applications. Therefore, it is interesting to study whether trace obliviousness has a differentially private counterpart. If related techniques can be developed, trace oblivious program execution will be more practical in such applications where absolute trace obliviousness is unnecessary but differential privacy is sufficient.
Appendix A: Proof of Theorem 1

A.1 Trace equivalence and lemmas

We shall further study some properties of trace equivalence. First of all, we define the length of a trace $t$, denoted as $|t|$ to be:

$$
|t| = \begin{cases} 
1 & \text{if } t = \text{read}(x, n) | \text{write}(x, n) | \text{readarr}(x, n, n') | \\
& \quad \quad \text{writearr}(x, n, n') \\
0 & \text{if } t = \epsilon \\
|t_1| + |t_2| & \text{if } t = t_1@t_2 
\end{cases}
$$

(A.1)

Lemma 1. If $t_1 \equiv t_2$, then $|t_1| = |t_2|$.

Proof. Let us prove by induction on how $t_1 \equiv t_2$ is derived. If $t_1 = t_2$, then the conclusion is obvious. If $t_1 = \epsilon@t_2$, then $|t_1| = |\epsilon| + |t_2| = |t_2|$. Similarly, we can prove the conclusion when $t_1 = t_2@\epsilon$, $t_2 = \epsilon@t_1$, $t_2 = t_1@\epsilon$, or $t_1 = \epsilon@t$ and $t_2 = t@\epsilon$.

If $t_1 = t_{11}@t_{12}$, $t_2 = t_{21}@t_{22}$, $t_{11} \equiv t_{21}$, and $t_{12} \equiv t_{22}$, then by induction, we have $|t_{11}| = |t_{21}|$, and $|t_{12}| = |t_{22}|$. Therefore $|t_1| = |t_{11}| + |t_{12}| = |t_{21}| + |t_{22}| = |t_2|$.

Finally, if $t_1 = (t'_1@t'_2)@t'_3$ and $t_2 = t'_1@((t'_2@t'_3)$, then $|t_1| = |t'_1@t'_2| + |t'_3| = |t'_1| + |t'_2| + |t'_3| = |t'_1| + |t'_2@t'_3| = |t_2|$. \qed
Now, we define the $i$-th element in a trace, denoted $t[i]$, as follows:

$$
t[i] = \begin{cases} 
\epsilon & \text{if } i \leq 0 \lor i > |t| \\
t & \text{if } i = 1 \land t = \text{read}(x, n) \mid \text{write}(x, n) \mid \\
\text{readarr}(x, n, n') \mid \text{writearr}(x, n, n') & \\
 t_1[i] & \text{if } t = t_1 @ t_2 \lor 1 \leq i \leq |t_1| \\
t_2[i - |t_1|] & \text{if } t = t_1 @ t_2 \lor |t_1| < i \leq |t|
\end{cases}
$$

It is easy to see that if $\forall i. t_1[i] = t_2[i]$ implies $|t_1| = |t_2|$ by the following lemma.

**Lemma 2.** $t[i] \neq \epsilon$ for all $i$ such that $1 \leq i \leq |t|$, and $\epsilon$ otherwise.

**Proof.** The second part of the conclusion is trivial since it directly follows the definition. We prove the first part by induction on $|t|$. If $|t| = 0$, then the conclusion is trivial.

If $|t| = 1$, and $1 \leq i \leq |t|$, then $i$ must be 1. Therefore, $t[i]$ is one of $\text{read}(x, n)$, $\text{write}(x, n)$, $\text{readarr}(x, n, n')$, and $\text{writearr}(x, n, n')$, and therefore $t[i] \neq \epsilon$.

If $|t| > 1$, then $t$ must be a concatenation of two subsequences, i.e. $t_1 @ t_2$. If $1 \leq i \leq |t_1|$, then $t[i] = t_1[i]$, and by induction, we know that $t[i] \neq \epsilon$. Otherwise, if $|t_1| < i \leq |t|$, then $0 < i - |t_1| \leq |t| - |t_1| = |t_2|$. For natural number $n$, $n > 0$ implies $n \geq 1$. Therefore $1 \leq i - |t_1| \leq |t_2|$, and by induction, we have $t[i] = t_2[i - |t_1|] \neq \epsilon$. □

Before we go to the next lemma, we shall define the canonical representation
of a trace. First, we define the number of blocks in a trace $t$, denoted by $\#(t)$, as

$$
\#(t) = \begin{cases} 
\#(t_1) + \#(t_2) & \text{if } t = t_1@t_2 \\
1 & \text{otherwise}
\end{cases}
$$

Then we define an order $\preceq_t$ between two traces $t_1$ and $t_2$ as follows: $t_1 \preceq_t t_2$ if and only if either of the following two conditions hold true: (i) $\#(t_1) < \#(t_2)$, or (ii) $\#(t_1) = \#(t_2) \geq 2$, $t_1 = t'_1@t''_1$, $t_2 = t'_2@t''_2$, and either of the following three sub-conditions holds true: (ii.a) $\#(t'_1) > \#(t'_2)$; (ii.b) $\#(t'_1) = \#(t'_2)$ and $t'_1 \preceq_t t'_2$; or (ii.c) $t'_1 = t'_2$ and $t''_1 \preceq t''_2$. It is easy to see that $\preceq_t$ is complete.

**Definition 10** (canonical representation). The canonical representation of a trace $t$ is the minimal element in the set \( \{ t' : t \equiv t' \} \) under order $\preceq_t$.

**Lemma 3.** The canonical representation of $t$ is (i) $\epsilon$ is $|t| = 0$; or (ii) $\text{can}(t) = \ldots((t_1@t_2)@t_3)\ldots@t_n$, where $n = |t| > 0$, and $t_i = t[i]$.

**Proof.** On the one hand, it is easy to see that $\text{can}(t)$ belongs to the set \( \{ t' : t \equiv t' \} \).

In fact, we can prove by induction on $\#(t)$. If $\#(t) = 1$, then either $|t| = 1$, or $|t| = 0$. For the former case, $t$ is one of read($x, n$), write($x, n$), fetch($p$), readarr($x, n, n'$), and writearr($x, n, n'$), and thus $t = t[1] = \text{can}(t)$. For the later case, $t = \epsilon$.

Now suppose $\#(t) > 1$, and thus $t = t'@t''$. Suppose $|t'| = l_1$ and $|t''| = l_2$. If $l_2 = 0$, by induction, $t'' = \epsilon$, and thus $t \equiv t'$. Furthermore, we have $|t| = |t'|$, and $\forall i. t[i] = t'[i]$ by definition. Therefore $t \equiv t' \equiv \text{can}(t') = \text{can}(t)$. Similarly, we can prove the conclusion is true when $l_1 = 0$. Now suppose $l_1 > 0$ and $l_2 > 0$, then $\text{can}(t') = \ldots(((t_1@t_2)@t_3)\ldots@t_{l_1})$, and $\text{can}(t'') = \ldots(((t_{l_1+1}@t_{l_1+2})@t_{l_1+3})\ldots@t_{l_1+l_2})$. 181
Then $t \equiv can(t')@can(t'') \equiv can(t)$.

On the other hand, we shall show that $can(t)$ is the minimal one in $\{t' : t \equiv t'\}$. To show this point, we only need to show that for all $t' \equiv can(t)$, we have $can(t) \preceq_t t'$.

We prove by induction on $n$. If $n = 1$, the conclusion is obvious. Suppose $n > 1$ and the conclusion holds true for all $n' < n$.

It is easy to see that $\#(t') > 1$, therefore we suppose $t' = t_l@t_r$. Then we prove that there exists $k$ such that $t_l \equiv (t_1@t_2)\ldots@t_k$ and $t_r \equiv (t_{k+1}@t_{k+2})\ldots@t_n$.

We prove by induction on $n$ and how many steps of equivalent-transitive rule, i.e., $t_1 \equiv t_2 \land t_2 \equiv t_3 \Rightarrow t_1 \land t_3$, should be applied to derive $can(t) \equiv t'$. If we should apply 0 step, then we know one of the following situations holds: (i) $t' = t''@t_n$ where $t'' \equiv (t_1@t_2)\ldots@t_{n-1}$; (ii) $t' = (t_1@t_2)\ldots@t_{n-2}@t_{n-1}@t_n$; (iii) $t' = t@\epsilon$; or (iv) $t' = \epsilon@t$. In any case, our conclusion holds true. Now suppose we need to apply $n > 0$ steps to derive $t'$, where in the $n-1$ step, we derive that $can(t) \equiv t''$ and we can derive $t'' \equiv t'$ without applying the equivalent-transitive rule. Therefore by induction, we know that $t'' = t_l''@t_r''$, and there is $k$ such that $t_l'' \equiv (t_1@t_2)\ldots@t_k$ and $t_r'' \equiv (t_{k+1}@t_{k+2})\ldots@t_n$. Since we can derive $t'' \equiv t'$ without applying equivalent-transitive rule, we know that one of the following situations holds:

1. $t_l'' \equiv t_l$ and $t_r'' \equiv t_r$;
2. $t' = \epsilon@t''$;
3. $t' = t''@\epsilon$;
4. $t'' = t'@\epsilon$;
5. \( t'' = \epsilon t' \);

6. \( \epsilon t' = t'' \epsilon \);

7. \( t' \epsilon = \epsilon t'' \);

8. \( t''_l = (t_{l1} @ t_{l2}) \), \( t_{l1} \equiv t_{l1} \), and \( t_r \equiv t_{l2} @ t''_r \) (in this case, we have \( (t_{l1} @ t_{l2}) @ t''_r \equiv t_{l1} @ (t_{l2} @ t''_r) \)); and

9. \( t''_r = (t_{r1} @ t_{r2}) \), \( t_{l1} \equiv t''_l @ t_{r1} \), and \( t_r \equiv t_{r2} \) (in this case, we have \( t''_l @ (t_{r1} @ t_{r2}) \equiv (t''_l @ t_{r1}) @ t_{r2} \)).

For the first 7 cases, the conclusion is trivial. For Case 8, by induction, we know there are some \( k' \) such that \( t_{l1} \equiv \ldots (t_1 @ t_2) @ t_{k'} \) and \( t_{l2} \equiv \ldots (t_{k'+1} @ t_{k'+2}) @ t_k \).

Therefore \( t_{l1} \equiv \ldots (t_1 @ t_2) @ t_{k'} \), and \( t_r \equiv \ldots (t_{k'+1} @ t_{k'+2}) @ t_k @ (\ldots (t_{k+1} @ t_{k+2}) @ t_n) \)
while the later is equivalent to \( \ldots (t_{k'+1} @ t_{k'+2}) @ t_n \). Similarly, we can prove under case 9, the conclusion is also true.

Next, we prove that \( \text{can}(t) \preceq_t t_l @ t_r \). To show this point, by induction, we know \( \ldots (t_1 @ t_2) @ t_k \preceq_t t_l \) and \( \ldots (t_{k+1} @ t_{k+2}) @ t_n \preceq_t t_r \). If either \( \#(t_l) > k \) or \( \#(t_r) > n - k \), we have \( \#(t') > n \) and thus \( \text{can}(t) \preceq_t t' \). Suppose \( \#(t_l) = k \) and \( \#(t_r) = n - k \). If \( k < n - 1 \), then by the definition of \( \preceq_t \), we have \( \text{can}(t) \preceq_t t' \). Next suppose \( k = n - 1 \), then by induction, we know \( \ldots (t_1 @ t_2) @ t_{n-1} \preceq_t t_l \), and thus \( \text{can}(t) \preceq_t t_l @ t_r = t' \).

Next is the most important lemma about trace-equivalence.

**Lemma 4.** \( t_1 \equiv t_2 \), if and only if \( \forall i. t_1[i] = t_2[i] \).
Proof. “⇒” Suppose $\forall i. t_1[i] = t_2[i]$. Then by Lemma 3, we know $can(t_1) = can(t_2)$, and thus $t_1 \equiv can(t_1) \equiv can(t_2) \equiv t_2$.

“⇐” Suppose $t_1 \equiv t_2$, then by Lemma 3, we have $can(t_1) \equiv can(t_2)$. Due to both $can(t_1)$ and $can(t_2)$ have the same form, we know they are identical. Therefore, we can conclude that $\forall i. t_1[i] = t_2[i]$. \qed

A.2 Lemmas on trace pattern equivalence

Trace pattern equivalence has similar properties as trace equivalence. If fact, we define the length of a trace pattern $T$, denoted as $|T|$, to be

$$|T| = \begin{cases} 
1 & \text{if } T = \text{Read}(x) \mid \text{Fetch}(p) \\
0 & \text{if } T = \epsilon \\
|T_1| + |T_2| & \text{if } T = T_1@T_2
\end{cases}$$

Similar to trace, we define the $i$-th element in a trace pattern $T$, denoted $T[i]$, as follows:

$$T[i] = \begin{cases} 
\epsilon & \text{if } i \leq 0 \lor i > |T| \\
T & \text{if } i = 1 \land T = \text{Read}(x) \\
T_1[i] & \text{if } T = T_1@T_2 \land 1 \leq i \leq |T_1| \\
T_2[i - |T_1|] & \text{if } T = T_1@T_2 \land |T_1| < i \leq |T|
\end{cases}$$

Using exactly the same technique, we can prove the following lemma:

Lemma 5. $T_1 \sim_L T_2$, if and only if $\forall i. T_1[i] = T_2[i]$.

To avoid verbosity, we do not provide the full proof here. It is quite similar to
the proof of Lemma 4

A.3 Proof of memory trace obliviousness

To prove Theorem 1, memory trace obliviousness by typing, we shall first prove the following lemma:

**Lemma 6.** If $\Gamma \vdash e : \text{Nat } L ; T$, then for any two $\Gamma$-valid low-equivalent memories $M_1, M_2$, if $\langle M_1, e \rangle \downarrow_{t_1} n_1, \langle M_2, e \rangle \downarrow_{t_2} n_2$, then $t_1 = t_2$ and $n_1 = n_2$

**Proof.** We use structural induction on expression $e$ to prove this lemma. If $e$ is in form of $x$, then $\Gamma(x) = \text{Nat } L$, and thus $M_1(x) = M_2(x) = n$ according to the definition of low-equivalence and $\Gamma$-validity. Therefore $t_1 = \text{read}(x, n) = t_2$, and $n_1 = n_2 = n$.

If $e$ is in form of $e_1 \ op \ e_2$, then $\Gamma \vdash e_1 : \text{Nat } L$ and $\Gamma \vdash e_2 : \text{Nat } L$. Suppose $\langle M_i, e_j \rangle \downarrow_{t_{ij}} n'_{ij}$, for $i = 1, 2$, $j = 1, 2$. Then $t_{1j} = t_{2j}$ and $n_{1j} = n_{2j}$ for $j = 1, 2$. Therefore $t_1 = t_{11} \@ t_{12} = t_{21} \@ t_{22} = t_2$, and $n_1 = n_{11} \ op \ n_{12} = n_{21} \ op \ n_{22} = n_2$.

Next, we consider the expression in form of $x[e]$. We know that $\Gamma(x) = \text{Array } L$, which implies $\Gamma \vdash e : \text{Nat } L$. Suppose $\langle M_i, e \rangle \downarrow_{t'_i} n'_i$, then by induction $t'_1 = t'_2$ and $n'_1 = n'_2$. Furthermore, since $M_1 \sim_L M_2$, we have $\forall i \in \text{Nat}. M_1(x)(i) = M_2(x)(i)$. Therefore $t_1 = t'_1 \@ \text{readarr}(x, n'_1, M_1(x)(n'_1)) = t'_2 \@ \text{readarr}(x, n'_2, M_2(x)(n'_2)) = t_2$, and $n_1 = M_1(x)(n'_1) = M_2(x)(n'_2) = n_2$.

Finally, the conclusion is trivial for constant expression. □

For convenience, we define $lab : \text{Type} \to \text{SecLabels}$ as:
\[
lab(\tau) = \begin{cases} 
    l & \text{if } \tau = \text{Int } l \\
    l & \text{if } \tau = \text{Array } l
\end{cases}
\]

Similar to Lemma 6, we can prove the following lemma:

**Lemma 7.** If \( \Gamma \vdash e : \text{Nat } l ; T \) and \( l \in \text{ORAMBanks} \), then for any two \( \Gamma \)-valid low-equivalent memories \( M_1, M_2 \), if \( \langle M_1, e \rangle \downarrow_{t_1} n_1, \langle M_2, e \rangle \downarrow_{t_2} n_2 \), then \( t_1 = t_2 \)

**Proof.** If \( l = L \), then the conclusion is obvious by Lemma ???. We only consider \( l \in \text{ORAMBanks} \). We use structural induction to prove this lemma. If \( e \) is in form of \( x \), then according to the definition of \( \Gamma \)-validity and \( \text{evt}() \), we have \( t_1 = \text{lab}(\Gamma(x)) = t_2 \).

If \( e \) is in form of \( e_1 \ op e_2 \), then \( \Gamma \vdash e_1 : \text{Nat } l_1 \) and \( \Gamma \vdash e_2 : \text{Nat } l_2 \). Suppose \( \langle M_i, e_j \rangle \downarrow_{t_{ij}} n_{ij} \), for \( i = 1, 2, j = 1, 2 \). Then \( t_{1j} = t_{2j} \), for \( j = 1, 2 \) by induction. Therefore \( t_1 = t_{11}@t_{12} = t_{21}@t_{22} = t_2 \).

Finally, we consider the expression in form of \( x[e] \). We know that \( \Gamma \vdash e : \text{Nat } l' \). Suppose \( \langle M_i, e \rangle \downarrow_{t_i'} n_i' \). If \( l' = L \), then \( t_1' = t_2' \) by Lemma ???. Otherwise, \( l \in \text{ORAMBanks} \), and by induction assumption, we have \( t_1' = t_2' \). Since \( l \in \text{ORAMBanks} \), we know \( l = \text{lab}(\Gamma(x)) \), and thus \( t_1 = t_1'@l = t_2'@l = t_2 \).

Now we shall study the property of trace pattern equivalence. First of all, we have the following lemma:

**Lemma 8.** Suppose \( s \) and \( S \) are a statement and a labeled statement respectively. If \( \Gamma, l_0 \vdash S; T, l_0 \in \text{ORAMBanks} \) and \( \langle M, S \rangle \downarrow_t M' \), then \( M \sim_L M' \).
Proof. We prove by induction on the statement $S$. Notice that the statement is impossible to be \textbf{while} statement. The conclusion is trivial for the statement \textbf{skip}.

If $s$ is $x := e$, then $l_0 \subseteq \text{lab}(\Gamma(x))$, and thus $\text{lab}(\Gamma(x)) \in \text{ORAMBanks}$. Therefore $M' = M[x \mapsto (n,l)]$ for some natural number $n$ and some security label $l$, which implies $M' \sim_L M$. Similarly, if $s$ is $x[e_1] := e_2$, then $\text{lab}(\Gamma(x)) \in \text{ORAMBanks}$. Furthermore, $\langle M, x[e_1] := e_2 \rangle \Downarrow^t M[x \mapsto (m,l)]$ for some mapping $m$, and some security label $l \in \text{ORAMBanks}$. Therefore $M' = M[x \mapsto (m,l)]$, which implies for $x$ such that $M(x) = (n,L)$, we know that $M'(x) = (n,L)$. Therefore $M' \sim_L M$.

Next, let us consider statement \textbf{if}($e,S_1,S_2$). Then we know either of the two conditions holds true: (1) $\langle M,S_1 \rangle \rightarrow M'$, and (2) $\langle M,S_2 \rangle \rightarrow M'$. Since $\Gamma,l_0 \vdash \text{if}(e,S_1,S_2);T$, we have $\Gamma,l' \vdash S_1;T_1$, and $\Gamma,l' \vdash S_2;T_2$, where $l_0 \subseteq l'$. Therefore we know for either condition, we have $M \sim_L M'$.

Finally, for sequence of two statements $S_1;S_2$, suppose $\langle M,S_1 \rangle \Downarrow^t M_1$, and $\langle M_1,S_2 \rangle \Downarrow^{\nu} M'$. Then $M \sim_L M_1 \sim_L M'$.

According to definition of the trace pattern equivalence, it is obvious to see that, if $T \sim_L T'$, then $T$ is a sequence, whose element each is in the form of $\text{Fetch}(p)$, $\text{Read}(x)$, $\epsilon$, and $o$.

We shall define a \textit{trace $t$ belongs to a trace pattern $T$, under a memory $M$}, denoted by $t \in T[M]$ as follows:
Lemma 9. \( t \in T[M] \) if and only if \(|t| = |T| \) and \( \forall i.t[i] \in (T[i])[M] \).

Proof. “⇒” Suppose \(|t| = |T| \) and \( \forall i.t[i] \in (T[i])[M] \). We prove by induction on \(#(t)\). If \(#(t) = 1\), then the conclusion is trivial. Assume the conclusion holds for all \(#(t') < n\), now suppose \(#(t) = n > 1\). Then we know \( t = t_1@t_2 \). If \( t_1 = \epsilon \), then we know \(|t_2| = |t| = |T| \) and \( \forall i.t_2[i] = t[i] \in (T[i])[M] \), by induction, we know \( t_2 \in T[M] \). Furthermore, we have \( t_1 = \epsilon \in \epsilon[M] \), therefore \( t_1@t_2 \in \epsilon@T[M] \). Since \( \epsilon@T \sim_L T \), we have \( t = t_1@t_2 \in T[M] \). A similar argument shows that if \( t_2 = \epsilon \), then we also have \( t \in T[M] \).

Now let us consider when \(|t_1| = 0\). By induction, we have \( t_1 \in \epsilon[M] \) and \( t_2 \in T[M] \), and then again, we have \( t \in T[M] \). Similarly, if \(|t_2| = 0\), we also have \( t \in T[M] \).

Now assume \(|t_1| > 0 \) and \(|t_2| > 0\), and suppose \( T_1 = (\ldots(T_1@T_2)\ldots@T_{|t_1|}) \) and \( T_2 = (\ldots(T_{|t_1|+1}@T_{|t_1|+2})\ldots@T_{|T|}) \). Then by induction, we know that \( t_1 \in T_1[M] \) and \( t_2 \in T_2[M] \), and thus \( t_1@t_2 \in T_1@T_2[M] \). According to Lemma 5, we have \( T_1@T_2 \sim_L T \), and thus \( t = t_1@t_2 \in T[M] \).

“⇐” We prove by induction on how many steps to derive \( t \in T[M] \). Suppose we need only 1 step, then one of the following four conditions is true: (i) \( t = \epsilon = T \);
(ii) \( t = o = T \); (iii) \( t = \text{read}(x,n), T = \text{Read}(x) \) and \( M(x) = n \). In either case, the conclusion is trivial.

Then suppose we need \( n \) step, and the last step is derived from \( t = t_1 @ t_2 \), \( T = T_1 @ T_2 \), and \( t_1 \in T_2[M] \) and \( t_2 \in T_2[M] \). Then by induction we have \( |t_1| = |T_1| \), \( |t_2| = |T_2| \), and for \( i \in [1,|T]| \), \( \forall i.t_1[i] \in (T_1[i])[M] \) and \( \forall i.t_2[i] \in (T_2[i])[M] \). For \( i < 1 \) or \( i > |T| \), then \( t[i] = \epsilon = T[i] \), and thus \( t[i] \in (T[i])[M] \). If \( 1 \leq i \leq |T| \), then \( t[i] = t_1[i] \) and \( T[i] = T_1[i] \), and by induction, we have \( t[i] \in (T[i])[M] \); if \( |T_1| < i \leq |T| \), then \( t[i] = t_2[i - |T_1|] \) and \( T[i] = T_2[i - |T_1|] \), and by induction, we have \( t[i] \in (T[i])[M] \).

Finally, suppose we need \( n \) step, and the last step is derived from \( t \in T'[M] \) and \( T' \sim_L T \). Then according to Lemma 5, we know that \( \forall i.T'[i] = T[i] \), which also implies that \( |T'| = |T| \). By induction, we have \( |t| = |T'| \) and \( \forall i.t[i] \in (T'[i])[M] \), and therefore, we have \( \forall i.t[i] \in (T[i])[M] \) and \( |t| = |T| \).

We have the following corollaries.

**Corollary 1.** If \( M_1 \sim_L M_2 \), and \( t \in T[M_1] \), then \( t \in T[M_2] \).

**Proof.** By Lemma 9, we only need to show that \( \forall i.t[i] \in (T[i])[M_2] \).

Let us prove by structural induction on how \( t \in T[M] \) is derived. If \( t = \epsilon = T \), or \( t = o = T \), or \( t = t_1 @ t_2 \) and \( T = T_1 @ T_2 \), then the conclusion is trivial. The only condition we need to prove is when \( t = \text{read}(x,n) \), and \( T = \text{Read}(x) \). If so, since \( t \in T[M_1] \), therefore \( M_1(x) = (n,L) \). Since \( M_1 \sim_L M_2 \), we know that \( M_2(x) = (n,L) \). Therefore, we have \( t = \text{read}(x,n) \in \text{Read}(x)[M_2] \) = \( T[M_2] \).

According to the definition of \( T[i] \), we know it is in one of the following three forms: \( \epsilon \), \( o \), or \( \text{Read} \). If \( T[i] = \epsilon \), then we know \( i < 1 \) or \( i > |T| = |t_1| \). Therefore
\( t[i] = \epsilon \), and thus \( t[i] \in (T[i])[M_2] \). If \( T[i] = o \), then we know \( t[i] = o \). In both situations, we have \( t[i] \in (T[i])[M_2] \). Finally, if \( T[i] = \text{Read}(x) \), then we know \( t[i] = \text{read}(x, n) \) where \( n = M_1[x] \). Since \( M_1 \sim_L M_2 \), we have \( M_2[x] = n \), and thus \( t[i] \in (T[i])[M_2] \).

**Corollary 2.** If \( t_1 \in T[M] \) and \( t_2 \in T[M] \), then \( t_1 \equiv t_2 \).

**Proof.** Assume \( t_1 \in T[M] \), and \( t_2 \in T[M] \), according to Lemma 9, we have \( |t_1| = |T| = |t_2| \), \( \forall i. t_1[i] \in (T[i])[M] \), and \( \forall i. t_2[i] \in (T[i])[M] \). According to the definition of \( T[i] \), we know it is in one of the following three forms: \( \epsilon \), \( o \), or \( \text{Read} \). If \( T[i] = \epsilon \), then we know \( i < 1 \) or \( i > |T| = |t_1| = |t_2| \). Therefore \( t_1[i] = t_2[i] = \epsilon \). If \( T[i] = o \), then we know \( t_1[i] = t_2[i] = o \). Finally, if \( T[i] = \text{Read}(x) \), then we know \( t_1[i] = \text{read}(x, n_1) \), \( n_1 = M[x] \), \( t_2[i] = \text{read}(x, n_2) \), and \( n_2 = M[x] \). Therefore \( n_1 = n_2 \), and thus \( t_1[i] = t_2[i] \). Therefore \( \forall i. t_1[i] = t_2[i] \), and according to Lemma 4, we have \( t_1 \equiv t_2 \).

Then we have the following lemmas:

**Lemma 10.** Suppose \( \Gamma \vdash e : \tau; T, T \sim_L T' \) for some \( T' \), and memory \( M \) is \( \Gamma \)-valid. If \( \langle M, e \rangle \Downarrow_t n \), then \( t \in T[M] \).

**Proof.** We prove by structural induction on \( e \). If \( e \) is \( n \), then \( T = \epsilon = t \).

If \( e \) is \( x \), then \( T = \text{evt}(\text{lab}(\Gamma(x)), \text{Read}(x)) \). If \( \text{lab}(\Gamma(x)) = l \in \text{ORAMBanks} \), then \( t = l \in l[M] \). If \( \text{lab}(\Gamma(x)) = L \), then \( T = \text{Read}(x) \), and \( t = \text{read}(x, n) \), where \( M(x) = (n, L) \). According to the definition, we know \( t \in T[M] \).

If \( e \) is \( e_1 \op e_2 \), then suppose \( \langle M, e_i \rangle \Downarrow_{t_i} n_i \) and \( \Gamma \vdash e_i : l_i; T_i \) for \( i = 1, 2 \).
Then according to the induction assumption, we have $t_i \in T_i[M]$ for $i = 1, 2$. Since $t = t_1 \cdot t_2$, and $T = T_1 \cdot T_2$, we know $t \in T[M]$.

Next we consider $x[e']$. Suppose $\Gamma \vdash e' : \text{Nat } l'; T'$, and $\langle M, e' \rangle \Downarrow_{v'} n'$, then $T = T' \cdot \text{evt}(\text{lab}(\Gamma(x)), \text{Readarr}(x))$, and $t = t' \cdot \text{evt}(\text{lab}(\Gamma(x)), \text{readarr}(x, n', n''))$ for some $n''$. Moreover, we have $t' \in T'[M]$ by induction. Since $T \sim_L T'$, we know $\text{lab}(\Gamma(x)) \in \text{ORAMBanks}$. Therefore $t = t' \cdot \text{lab}(\Gamma(x)) \in T' \cdot \text{lab}(\Gamma(x))[M] = T[M]$. 

\textbf{Lemma 11.} Assume $\Gamma, l_0 \vdash S; T$, $T \sim_L T'$ for some $T'$, and $l_0 \in \text{ORAMBanks}$, and $M$ is a $\Gamma$-valid memory. If $\langle M, S \rangle \Downarrow_{t} M'$, then $t \in T[M]$.

\textit{Proof.} We prove by structural induction on the statement $S$. Since $l_0 \neq L$, therefore we know $S$ cannot be a \textsf{while} statement. If $S$ is \textsf{skip}, then $T = \epsilon = t$.

Let us consider when $S$ is $x := e$. Then $\langle M, e \rangle \Downarrow_{v'} n'$, and $\Gamma \vdash e : \tau; T'$, and $T = T' \cdot \text{evt}(\text{lab}(\Gamma(x)), \text{Write}(x))$. Since $T \sim_L T$, $T$ does not contain \textsf{Write}(x), and thus $\text{lab}(\Gamma(x)) \in \text{ORAMBanks}$. Therefore $t = t' \cdot \text{lab}(\Gamma(x)) \in T' \cdot \text{lab}(\Gamma(x))[M]$ by Lemma 10.

Next, suppose $S$ is $x[e_1] = e_2$. Suppose $\langle M, e_i \rangle \Downarrow_{t_i} n_i$, and $\Gamma \vdash e_i : \tau; T_i$ for $i = 1, 2$ by induction. Then $t_i \in T_i[M]$ for $i = 1, 2$. Similar to the discussion for $x := e$, we know $\text{lab}(\Gamma(x)) \in \text{ORAMBanks}$, and thus $t = t_1 \cdot t_2 \cdot \text{lab}(\Gamma(x)) \in T_1 \cdot T_2 \cdot \text{lab}(\Gamma(x))[M]$.

Next, let us consider $(if)(e, S_1, S_2)$. Then $\Gamma, l_0 \vdash S_i; T_i$ for $i = 1, 2$, and $T_1 \sim_L T_2$. As well $\Gamma \vdash e : \tau; T_e$, $\langle M, e \rangle \Downarrow_{t_e} n_e, \text{idex} = \text{ite}(n_e, 1, 2)$, and $\langle M, S_{\text{idex}} \rangle \Downarrow_{t_{\text{idex}}} M'$. Then $T = T_e \cdot T_1$, and $t_e \in T_e[M]$. If $\text{idex} = 1$, then $\langle M, S_1 \rangle t_1 M'$, and thus
\( t_1 \in T_1[M] \). Therefore \( t = t_e@t_1 \in T_e@T_1[M] = T[M] \). Similarly, if \( idx = 2 \), then \( \langle M, S_2 \rangle t_2 M' \), and thus \( t_2 \in T_2[M] \). Therefore \( t_2 \in T_1[M] \). As a conclusion \( t = t_e@t_2 \in T_e@T_1[M] = T[M] \).

Finally, suppose \( S \) is \( S_1; S_2 \). Then we know \( \Gamma, l_0 \vdash S_i; T_i \) for \( i = 1, 2 \), \( \langle M, S_1 \rangle \Downarrow t_1 M' \), and \( \langle M', S_2 \rangle \Downarrow t_2 M'' \). Since \( l_0 \in ORAMBanks \), we know \( M \sim_L M' \sim_L M'' \). By induction assumption, we know \( t_1 \in T_1[M] \), and \( t_2 \in T_2[M'] \). Since \( M \sim_L M' \), according to Corollary 1, we know \( t_2 \in T_2[M] \). Therefore \( t = t_1@t_2 \in T_1@T_2[M] = T[M] \). □

**Lemma 12.** Suppose \( \Gamma, l_0 \vdash S_i; T_i \) for \( i = 1, 2 \), where \( l_0 \in ORAMBanks \), and \( T_1 \sim_L T_2 \). Given two \( \Gamma \)-valid low-equivalent memories \( M_1, M_2 \), if \( \langle M_1, S_1 \rangle \Downarrow t_1 M'_1 \), and \( \langle M_2, S_2 \rangle \Downarrow t_2 M'_2 \), then \( M'_1 \sim_L M'_2 \), and \( t_1 \equiv t_2 \).

**Proof.** According to Lemma 11, we know that \( t_i \in T_i[M_i] \) for \( i = 1, 2 \). According to Lemma 8, we know that \( M'_1 \sim_L M_1 \) and \( M'_2 \sim_L M_2 \). Since \( M_1 \sim_L M_2 \), we know that \( M'_1 \sim_L M_1 \sim_L M_2 \sim_L M'_2 \). Because \( t_1 \in T_1[M_1] \), and \( M_1 \sim_L M_2 \), therefore \( t_1 \in T_1[M_2] \). Furthermore, since \( T_1 \sim_L T_2 \), we have \( t_1 \in T_2[M_2] \). Finally, since \( t_2 \in T_2[M_2] \), and according to Corollary 2 we have \( t_1 \equiv t_2 \). □

Now we are ready to prove Theorem 1.

**Proof of Theorem 1.** We extend this conclusion by considering both normal statement and labeled statement, and shall prove by induction on the statement \( s \). For notational convention, we suppose \( \langle M_1, s \rangle \Downarrow t_1 M'_1 \), and \( \langle M_2, s \rangle \Downarrow t_2 M'_2 \), and thus \( \Gamma, l_0 \vdash S; T \) with \( M_1 \sim_L M_2 \) and both are \( \Gamma \)-valid. Our goal is prove \( t_1 \equiv t_2 \), and \( M_1 \sim_L M_2 \).
If $s$ is skip, it is obvious.

Suppose $s$ is $x := e$, then $\Gamma \vdash e : \text{Nat } l; T$. Suppose $\langle M_i, e \rangle \Downarrow_{t_i} n_i$, for $i = 1, 2$. According to Lemma 6 and Lemma 7, we know $t'_1 = t'_2$. If $\text{lab}(\Gamma(x)) \in \text{ORAMBanks}$, then $M'_1 = M_1[x \mapsto (n_1, l_1)] \sim_L M_1 \sim_L M_2[x \mapsto (n_2, l_2)] = M'_2$, and $t_1 = t'_1@\text{lab}(\Gamma(x)) = t'_2@\text{lab}(\Gamma(x)) = t_2$, which implies $t_1 \equiv t_2$.

If $\Gamma(x) = \text{Nat } L$, then we know $\Gamma \vdash e : \text{Nat } L; T$, and according to Lemma 6, we have $n_1 = n_2$. Then we also have $M'_1 = M_1[x \mapsto n_1] \sim_L M_2[x \mapsto n_2] = M'_2$, and $t_1 = t'_1@\text{read}(x, n_1) \equiv t'_2@\text{read}(x, n_2) = t_2$.

Next, suppose $s$ is $x[e_1] := e_2$. Suppose $\langle M_i, e_j \rangle \Downarrow_{t_{ij}} n_{ij}$ for $i = 1, 2$, $j = 1, 2$. If $\text{lab}(\Gamma(x)) = L$, then we know $\Gamma \vdash e_j : \text{Nat } L$, for $j = 1, 2$. Then by Lemma 6, we have $t_{ij} = t_{2j}$, which implies $t_{ij} \equiv t_{2j}$, $n_{ij} = n_{2j}$, and according to the definition of $\Gamma$-validity and low-equivalence, $\forall i. M_1(x)(i) = M_2(x)(i)$. Therefore $t_1 = t_{11}@t_{21}@\text{writearr}(x, n_{11}, n_{12}) \equiv t_{21}@t_{22}@\text{writearr}(x, n_{21}, n_{22}) = t_2$, and $M'_1 = M_1[x \mapsto M_1(x)[n_{11} \mapsto n_{12}]] \sim_L M_2[x \mapsto M_2(x)[n_{21} \mapsto n_{22}]] = M'_2$.

Otherwise, if $\Gamma(x) \in \text{ORAMBanks}$, suppose $\Gamma \vdash e_i : \text{Nat } l_i; T_i$, for $i = 1, 2$. Then we know $l_0 \sqcup l_1 \sqcup l_2 \sqsubseteq \text{lab}(\Gamma(x))$. Therefore, by Lemma 7, based on the same reasoning as above for $\text{Nat } l$ case, we have $t_1 = t_{11}@t_{21}@\Gamma(x) \equiv t_{21}@t_{22}@\Gamma(x) = t_2$. Furthermore, $M'_1 = M_1[x \mapsto m_1] \sim_L M_1 \sim_L M_2 \sim_L M_2[x \mapsto m_2] = M'_2$ for some two mappings, $m_1$ and $m_2$.

Then suppose the statement is if$(e, S_1, S_2)$. There are two situations. If $\Gamma \vdash e : \text{Nat } l_e; T_e$, where $l_e \sqcup l_0 \in \text{ORAMBanks}$, then according to Lemma 12, we know $M'_1 \sim_L M'_2$, and $t_1 \equiv t_2$. Otherwise, we have $l_e = L$ and $l_0 = L$. Suppose $\langle M_i, e \rangle \Downarrow_{t_i} n_i$, for $i = 1, 2$, then according to Lemma 6, we know $t'_1 = t'_2$, which
implies $t'_1 \equiv t'_2$, and $n_1 = n_2$. If $\text{ite}(n_1, 1, 2) = 1$, then we know $\langle M_1, S_1 \rangle \Downarrow_{t'_1} M'_1$ and $\langle M_2, S_1 \rangle \Downarrow_{t'_2} M'_2$. Therefore $t_1 = t'_1 @ t''_1 \equiv t'_1 @ t''_2 = t_2$, and $M'_1 \sim_L M'_2$ by induction. We can show the conclusion for $\text{ite}(n_1, 1, 2) = 2$ similarly.

Next, let us consider the statement $\text{while}(e, S)$. We know $\Gamma \vdash e : \text{Nat} \ L; T$, therefore there exists a constant $n$, and a trace $t$, such that $\langle M_i, e \rangle \Downarrow_{t}, n$ for both $i = 1, 2$, by Lemma 6.

We prove by induction on how many steps applying the S-WhileT rule and S-WhileF rule (WHILE rules for short) to derive $\langle \text{while}(e, S), M \rangle \Downarrow_{t_i}$. If we only apply one time, then we must apply S-WhileF rule, and thus $n = 0$. Then we have $t_1 = t = t_2$, and $M'_1 = M_1 \sim_L M_2 = M'_2$. Suppose the conclusion is true when we need to apply $n - 1$ steps of WHILE rules, now let us consider when we need to apply $n > 0$ steps. Then we know $n \neq 0$. Suppose $\langle M_i, S \rangle \Downarrow_{t_{i1}} M_{i1}$, and $\langle M_{i1}, \text{while}(e, S) \rangle \Downarrow_{t_{i2}} M'_i$, for $i = 1, 2$. Then we know that we need to apply $n - 1$ steps of WHILE rules to derive $\langle M_{i1}, \text{while}(e, S) \rangle \Downarrow_{t_{i2}} M'_i$. By induction, we have $t_{i1} = t_{21}$, $t_{i2} = t_{22}$, and $M_{i1} \sim_L M_{21}$. Therefore $M'_1 \sim_L M'_2$, and $t_1 = t_{i1} @ t_{i2} = t_{21} @ t_{22} = t_2$.

Finally, let us consider $S_1; S_2$. Suppose $\Gamma, l_0 \vdash S_1; T_1, \Gamma, l_0 \vdash S_2; T_2, \langle M_i, S_i \rangle \Downarrow_{t_{i1}} M_{i1}$, and $\langle M_{i1}, S_2 \rangle \Downarrow_{t_{i2}} M'_i$. Then by induction assumption, we have $t_{i1} = t_{21}$, $t_{i2} = t_{22}$, $M_{i1} \sim_L M_{12}$, and thus $M'_1 \sim_L M'_2$. Therefore $t_1 = t_{i1} @ t_{i2} = t_{21} @ t_{22} = t_2$. $\square$
Appendix B: Proof of Theorem 2

Our proof proceeds in two steps. First, we prove a terminating version (Theorem 6) of Theorem 2. By having the assumption that the program will finally terminate, i.e. its execution does not end in an infinite loop, we will show how our type system enforces the MTO property. Then by using Theorem 6, we will show the MTO property holds for non-terminating programs, (i.e. Theorem 7), which implies Theorem 2 as an obvious corollary.

We start with the terminating case. We first prove some useful lemmas for terminating programs.

**Lemma 13.** For all $I, \ell, \Upsilon, \text{Sym}, \Upsilon', \text{Sym}', T$, such that

$$\ell \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T$$

if there is some $i$ such that $I(i) = \textbf{jmp} \ n$, then $0 \leq i + n \leq |I|$.

**Proof.** We prove by induction on

$$\ell \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T$$

It is clearly that rule T-LOAD, T-STORE, T-LOADW, T-STOREW, T-IDB, T-
BOP, T-ASSIGN, T-NOP cannot derive $I$.

If this judgement is derived using rule T-SEQ, then we know $I = I_1; I_2$, and we have

$$\ell \vdash I_1 : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon'', Sym'' \rangle; T_1$$

$$\ell \vdash I_2 : \langle \Upsilon'', Sym'' \rangle \rightarrow \langle \Upsilon', Sym' \rangle; T_2$$

Then either $i < |I_1|$, or $|I_1| \leq i < |I_1| + |I_2| = |I|$. For the first case, we have $I_1(i) = \texttt{jmp } n$, and thus by induction, we have $0 \leq i + n \leq |I_1| < |I|$. For the second case, we have $I_2(i - |I_1|) = \texttt{jmp } n$, and thus by induction assumption, we have $0 \leq i - |I_1| \leq |I_2|$. Therefore, we have $0 < |I_1| \leq i \leq |I_1| + |I_2| = |I|.$

If this judgement is derived by rule T-IF, then we know $I = \iota_1; I_t; \iota_2; I_f$, and

$$\iota_1 = \texttt{br } r_1 \texttt{ rop } r_2 \leftrightarrow n_1$$

$$\iota_2 = \texttt{jmp } n_2$$

$$n_1 - 2 = |I_t|$$

$$n_2 = |I_f| + 1$$

Then, there are three possible scenarios:

1. $1 \leq i \leq 1 + |I_t|$. In this case, we know $I_t(i-1) = \texttt{jmp } n$, and $0 \leq i-1+n \leq |I_t|$. Therefore $0 < 1 \leq i + n \leq |I_t| + 1 < |I|$;

2. $i = 1 + |I_t|$. In this case, $I(i) = \iota_2$. Therefore, we have $0 < i + n_2 =$
2 + |I| + |I_2| = |I|

3. 2 + |I| ≤ i < 2 + |I_1| + |I_2| = |I|. In this case, we can prove the conclusion similarly to Case 1.

If this judgement is derived by rule T-WHILE, then we can prove the conclusion similarly to the T-IF case.

Finally, if the judgement is derive by rule T-SUB, then the result follows by induction. □

**Lemma 14.** For all I, ℓ, Υ, Sym, Υ', Sym', T, such that

\[ ℓ \vdash I : (\langle Υ, Sym \rangle \rightarrow (\langle Υ', Sym' \rangle); T \]

if there is some i such that I(i) = br r_1 rop r_2 ↵ n, then 0 ≤ i + n ≤ |I|.

*Proof (sketch).* The proof is similar to the proof for Lemma 13. □

**Lemma 15.** Given I = I_1; I_2; I_3, R_i, S_i, M_i, pc_i for i = 1, ..., k + 1, and t_i for i = 1, ..., k,

|I_1| ≤ pc_i < |I_1| + |I_2| \quad ∀i ∈ \{1, ..., k\}

|I_1| ≤ pc_{k+1} ≤ |I_1| + |I_2|

and

|t_i| = 1 \quad ∀i ∈ \{1, ..., k\}.

Then

\[ I \vdash (R_i, S_i, M_i, pc_i) \rightarrow t_i (R_{i+1}, S_{i+1}, M_{i+1}, pc_{i+1}) \]
holds true for $i = 1, \ldots, k$, if and only if

\[
I_2 \vdash (R_i, S_i, M_i, pc_i - |I_1|) \rightarrow_{t_i} (R_{i+1}, S_{i+1}, M_{i+1}, pc_{i+1} - |I_1|)
\]

holds true for $i = 1, \ldots, k$.

Proof. We only prove the only-if-direction, and the if-direction is similar. We prove by induction on $k$. We first prove for $k = 1$. Consider each possibility for $I(pc_1)$.

Case ldb $k \leftarrow l[r]$. By rule LOAD, we know

\[
n = |R(r) \mod \text{size}(l)|
\]

\[
b = M(l, n)
\]

\[
S_2 = S_1[k \mapsto (b, (l, n))]
\]

\[
t_1 = \text{select}(l, \text{read}(n, b), \text{eread}(n), l)
\]

\[
R_2 = R_1 \quad M_2 = M_1 \quad pc_2 = pc_1 + 1
\]

Clearly, we have $pc_2 - |I_1| = pc_1 - |I_1| + 1$, and $I_2(pc_1 - |I_1|) = \text{ldb } k \leftarrow l[r]$, and thus

\[
I_2 \vdash (R_1, S_1, M_1, pc_1 - |I_1|) \rightarrow_{t_1} (R_2, S_2, M_2, pc_2 - |I_1|)
\]

Cases stb $k$, k $\leftarrow$ idb $r$, ldw $r_1 \leftarrow k[r_2]$, stw $r_1 \rightarrow k[r_2]$, $r_1 \leftarrow r_2 \text{ op } r_3$, $r_1 \leftarrow n$, or nop are similar.
Case jmp \( n' \). We have

\[
R_2 = R_1 \quad S_2 = S_1 \quad M_2 = M_1 \quad pc_2 = pc_1 + n'
\]

Since \( pc_2 - |I_1| \models pc_1 - |I_1| + n' \), then the conclusion is obvious. We can prove similarly for \( I(pc_1) = \textbf{br} \ r_1 \ rop \ r_2 \leftrightarrow n' \).

Now, we assume the conclusion holds true for \( k \leq k' \). For \( k = k' + 1 \), We know

\[
I_2 \vdash (R_i, S_i, M_i, pc_i - |I_1|) \rightarrow t_i (R_{i+1}, S_{i+1}, M_{i+1}, pc_{i+1} - |I_1|)
\]

holds true for \( i = 1, \ldots, k' \) by the induction assumption. Further, since the assumption says the conclusion holds for \( k = 1 \), thus

\[
I_2 \vdash (R_{k'}, S_{k'}, M_{k'}, pc_{k'} - |I_1|) \rightarrow t_{k'}
\]

\[
(R_{k'+1}, S_{k'+1}, M_{k'+1}, pc_{k'+1} - |I_1|)
\]

Therefore, the conclusion holds true.

\[\square\]

Using Lemma 13, 14, 15, we can prove the following important lemma.

**Lemma 16.** Given \( I = I_a; \ell'; I_b \), where any of \( I_a \) and \( I_b \) can be empty, \( \ell, \ell' \), \( \Upsilon_1, \Upsilon'_1, \Upsilon_2, \Upsilon'_2, \text{Sym}_1, \text{Sym}'_1, \text{Sym}_2, \text{Sym}'_2, T, T' \), such that

\[\ell \vdash I : \langle \Upsilon_1, \text{Sym}_1 \rangle \rightarrow \langle \Upsilon_2, \text{Sym}_2 \rangle; T\]
and

\[ \ell' \vdash I' : \langle \Upsilon'_1, Sym'_1 \rangle \rightarrow \langle \Upsilon'_2, Sym'_2 \rangle; T' \]

If for \( pc = |I_a| \) and \( pc' \), where \( pc' < |I_a| \) or \( pc' \geq |I_a| + |I'| \), such that

\[ I \vdash (R, S, M, pc) \rightarrow_{t'} (R', S', M', pc'), \]

then there exists \( R'', S'', M'', pc'' \), \( t', t'' \), such that \( t = t'@t'' \) (where \( t'' \) can be \( \epsilon \)), and

\[ I \vdash (R, S, M, pc) \rightarrow_{t'} (R'', S'', M'', pc'') \]

\[ I \vdash (R'', S'', M'', pc'') \rightarrow_{t''} (R', S', M', pc') \]

\[ pc'' = |I_a| + |I'| \quad t \equiv t'@t'' \quad |t'| > 0 \quad |t''| \geq 0 \]

We provide an intuitive explanation of this lemma as follows. It says suppose a type-checked program \( I \) contains a type-checked segment \( I' \), and if the program runs from the start of the segment \( I' \) (i.e. \( pc = |I_a| \)), and ends outside the segment (i.e. \( pc' < |I_a| \) or \( pc' \geq |I_a| + |I'| \)), then it must stop at the end of the segment (i.e. \( pc'' = |I_a| + |I'| \)) first.

Notice the correctness of this lemma does not depend on whether \( I_a \) or \( I_b \) can type-check. Further, none or either or even both of \( I_a \) and \( I_b \) can be empty. In the last case, the conclusion is trivial.

Now we prove this lemma.
Proof. We suppose for $k = |t| - 1$,

$$I \vdash (R_i, S_i, M_i, pc_i) \rightarrow_t (R_{i+1}, S_{i+1}, M_{i+1}, pc_{i+1})$$

for $i = 0, ..., k$, where

$$|t_i| = 1 \quad \forall i \in \{0, ..., k\}$$

$$R^0 = R \quad S^0 = S \quad M^0 = M \quad pc_0 = pc$$

$$R^k = R' \quad S^k = S' \quad M^k = M' \quad pc_k = pc'$$

Suppose $i^*$ is the smallest one such that $pc_{i^*} < |I_a|$ or $pc_{i^*} \geq |I_a| + |I'|$. Then, by Lemma 15, we know

$$I' \vdash (R_i, S_i, M_i, \hat{pc}_i) \rightarrow_t (R_{i+1}, S_{i+1}, M_{i+1}, \hat{pc}_{i+1})$$

for $i = 0, ..., i^* - 2$, where $\hat{pc}_i = pc_i - |I|$. Clearly, we know $0 \leq \hat{pc}_{i^*-1}, pc_{i^*} < |I'|$ by the definition of $i^*$. Now, we consider the instruction $I'(\hat{pc}_{i^*-1})$. If it is not jmp $n'$ or br $r_1 rop r_2 \rightarrow n'$, then we know

$$pc_{i^*} = pc_{i^*-1} + 1 > |I_a|.$$ 

Therefore

$$pc_{i^*} \geq |I_a| + |I'|$$

201
Further, since $pc_{i,-1} - |I_a| = \hat{pc}_{i,-1} < |I'|$, we have

$$pc_i \leq |I_a| + |I'|$$

Therefore, we have $pc_i = |I_a| + |I'|$, and thus $R'' = R_i$, $S'' = S_i$, $M'' = M_i$, $pc'' = pc_i$, $t' = t_0@...@t_{i-1}$, and $t'' = t_i@...t_k$ satisfy the property.

If $I'(\hat{pc}_i - 1) = \textbf{jmp } n'$, then by Lemma 13, we have

$$0 \leq \hat{pc}_{i-1} + n' \leq |I'|$$

Therefore, we have

$$pc_i = pc_{i-1} + n' = \hat{pc}_{i-1} + |I_a| + n' \leq |I_a| + |I'|$$

and

$$pc_i = pc_{i-1} + n' = \hat{pc}_{i-1} + |I_a| + n' \geq |I_a|$$

Therefore, by the same argument as above, we know $pc_i = |I_a| + |I'|$, and the conclusion is true.

Finally, if $I'(\hat{pc}_i - 1) = \textbf{br } r_1 \textbf{ rop } r_2 \rightarrow n'$, we can prove the conclusion similarly using Lemma 14.

Now we can state and prove Theorem 6. Some judgments mentioned in the theorem are defined in Figure B.1.

**Theorem 6.** Given a program $I$ in $\mathcal{L}_T$, such that $\ell \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T$, 202
\[
\begin{align*}
\text{mem}((b, (l, n))) & = l \\
\text{idx}((b, (l, n))) & = n \\
\text{block}((b, (l, n))) & = b
\end{align*}
\]

\[
\forall k \in \text{BlockIDs}, \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall r \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]

\[
\begin{align*}
\forall k \in \text{BlockIDs}. \\
\forall \ell \in \text{Registers}. \text{Sym}(r) = \text{L} \Rightarrow R_1(r) = R_2(r)
\end{align*}
\]
and

\[ I \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1} (R_1', S_1', M_1', pc') \]

\[ I \vdash (R_2, S_2, M_2, 0) \rightarrow_{t_2} (R_2', S_2', M_2', pc'') \]

where \( pc' = pc'' = |I| \), then we have the following conclusions:

1. \( M_1' \sim L M_2' \);

2. \( \Upsilon' \vdash R_1' \sim R_2' \);

3. \( \Upsilon' \vdash S_1' \sim S_2' \);

4. \( \forall r \in \text{Registers}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} \text{Sym}(r)) \land \vdash_{\text{safe}} \text{Sym}'(r) \Rightarrow (S_i', \text{Sym}'(r)) \Downarrow R_i'(r) \);

5. \( \forall k \in \text{BlockIDs}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} \text{Sym}(k)) \land \vdash_{\text{safe}} \text{Sym}'(k) \Rightarrow \exists n. (S_i', \text{Sym}'(k)) \Downarrow n \land |n \mod \text{size}(\Upsilon'(k))| = \text{idx}(S_i'(k)) \);

6. \( t_1 \equiv t_2 \)

We first provide an intuition of this theorem.

Proof of Theorem 6. We prove Theorem 6 by induction on the length of derivation to derive

\[ L \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T \]

Case T-SEQ. Then \( I = I_1; I_2 \), and by inversion, we have

\[ L \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T \]
\[
L \vdash I_1 : (\Upsilon, Sym) \rightarrow (\Upsilon_1, Sym_1); T_1
\]
\[
L \vdash I_2 : (\Upsilon_1, Sym_1) \rightarrow (\Upsilon', Sym'); T_2
\]

By Lemma 15 and Lemma 16, we know

\[
I_1 \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1^1} (R_1'', S_1'', M_1'', |I_1|)
\]
\[
I_2 \vdash (R_1'', S_1'', M_1'', 0) \rightarrow_{t_2^1} (R_1', S_1', M_1', |I_2|)
\]

Similarly, we have

\[
I_1 \vdash (R_2, S_2, M_2, 0) \rightarrow_{t_1^2} (R_2'', S_2'', M_2'', |I_1|)
\]
\[
I_2 \vdash (R_2'', S_2'', M_2'', 0) \rightarrow_{t_2^2} (R_2', S_2', M_2', |I_2|)
\]

By induction assumption, we have

1. \(M_1'' \sim_L M_2''\);

2. \(\Upsilon \vdash R_1'' \sim R_2''\);

3. \(\Upsilon \vdash S_1'' \sim S_2''\);

4. \(\forall r \in \text{Registers}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} Sym''(r)) \land \vdash_{\text{safe}} Sym''(r) \Rightarrow (S_i'', Sym''(r)) \downarrow R_i'(r)\);

5. \(\forall k \in \text{BlockIDs}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} Sym''(k)) \land \vdash_{\text{safe}} Sym''(k) \Rightarrow \exists n. (S_i'', Sym''(k)) \downarrow n \land n \mod \text{size}(\Upsilon''(k)) = idx(S_i''(k)).\)
6. \( t_1^1 \equiv t_2^1 \)

By induction assumption again, we have conclusions 1-5, and \( t_1^2 \equiv t_2^2 \). Therefore, we know \( t_1 = t_1^1 @ t_1^2 \equiv t_2^1 @ t_2^2 = t_2 \), which is conclusion 6.

**Case T-LOAD.** Then \( I = ldb \ k \gets l[r] \), and \( pc' = pc'' = 1 \). Further, by inversion, we know

\[
\begin{align*}
n_i &= |R_i(r) \mod size(l)| \\
b_i &= M_i(l, n_i) \quad S'_i = S_i[k \mapsto (b_i, (l, n_i))] \\
t_i &= select(l, read(n_i, b_i), ered(n_i), l) \quad i = 1, 2
\end{align*}
\]

We prove conclusions 1-6 hold true. 1 holds true trivially, since the memories are not changed, i.e. \( M_1' = M_1 \sim_L M_2 = M_2' \), and 2 and 4 hold true for the same reason.

We prove conclusion 3 as follows. First, we know \( \Upsilon'(k) = l = \text{mem}(S'_1(k)) = \text{mem}(S'_2(k)) \). If \( \Upsilon'(k) = D \) or \( \Upsilon'(k) = E \), then we know \( l \not\in \text{ORAMbanks} \), and thus \( \Upsilon'(r) = L \), which implies that \( idx(S'_1(k)) = n_1 = n_2 = idx(S'_2(k)) \). Further, if \( \Upsilon'(k) = l = D \), then we know \( b_1 = M_1(D, n_1) = M_2(D, n_2) = b_2 \) (due to \( M_1 \sim_L M_2 \)), which implies \( S'_1(k) = S'_2(k) \). Therefore, we know \( \Upsilon \vdash S'_1 \sim S'_2 \).

For conclusion 5, since for all \( k' \neq k \), \( Sym'(k') = Sym(k') \) and \( S'_i(k') = S_i(k') \) \((i = 1, 2)\), therefore conclusion 7 holds true. For \( k \), if \( \vdash \text{safe} Sym'(k) \) does not hold, then the conclusion is trivial. Now we assume \( \vdash \text{safe} Sym'(k) \), if and only if \( \vdash \text{safe} Sym(r) \). By assumption 6, we know \( (S_i, Sym(r)) \not\Downarrow R_i(r) \). Since \( |R_i(r) \mod size(\Upsilon'(k))| = |R_i(r) \mod size(l)| = n_i \), we know conclusion 7 holds true.
For conclusion 6, if \( l = D \), then following the discussion for conclusion 3, we know \( n_1 = n_2 \) and \( b_1 = b_2 \), and thus \( t_1 = \text{read}(n_1, b_1) = \text{read}(n_2, b_2) = t_2 \). If \( l = E \), then similarly we know \( n_1 = n_2 \), and thus \( t_1 = \text{eread}(n_1) = \text{eread}(n_2) = t_2 \). If \( l \in \text{ORAMBanks} \), then we know \( t_1 = l = t_2 \). Thus conclusion 6 holds true.

**Case T-STORE.** If \( I = \text{stb} \ k \), and \( I \) is typed using rule T-STORE, then we have \( pc' = pc'' = 1 \). Further, we know

\[
(b_i, a_i) = S_i(k) \quad a_i = (l_i, n_i) \quad M'_i = M_i[a_i \mapsto b_i]
\]

\[
t_i = \text{select}(l_i, \text{write}(n_i, b_i), \text{ewrite}(n_i), l_i) \quad i = 1, 2
\]

Conclusions 2-5 are trivial, since registers and scratchpads are not changed. We first prove conclusion 6. By assumption 3, we know \( l_1 = l_2 \). If \( l_1 = D \), then we know \( t_1 = \text{write}(n_1, b_1) \), and \( t_2 = \text{write}(n_2, b_2) \). By assumption 4, we know \( n_1 = n_2 \) and \( b_1 = b_2 \), therefore \( t_1 = t_2 \). If \( l_1 = E \), then by assumption 5, we know \( n_1 = n_2 \). Therefore \( t_1 = \text{ewrite}(n_1) = \text{ewrite}(n_2) = t_2 \). Finally, if \( l_1 \in \text{ORAMBanks} \), then \( t_1 = l_1 = l_2 = t_2 \).

Now we prove conclusion 1. The only difference between \( M'_i \) and \( M_i \) is the value for \( a_i = (l_i, n_i) \). To show that \( M'_1 \sim_L M'_2 \), we only need to show that if \( l_1 = l_2 = D \), and \( b_1 = b_2 \). This point is induced by assumption 4. Therefore, conclusion 1 holds.

**Case T-LOADW.** If \( I = \text{ldw} \ x \leftarrow k[r_y] \), and \( I \) is typed using rule T-
LOADW, then we have $pc' = pc'' = 1$. Further, we know

$$(b_i, a_i) = S_i(k) \quad n_i = |R(r_y) \mod size(b_i)|$$

$$R'_i = R_i[r_x \mapsto b_i(n_i)] \quad i = 1, 2$$

Since scratchpad and memories are not changed, conclusions 1, 3, and 5, trivially hold true. Further, since $t_1 = f = t_2$, conclusion 6 is also true. We only need to prove for conclusions 2 and 4. For conclusion 2, if $\Upsilon'(r_x) = H$, the conclusion is trivial. If $\Upsilon'(r_x) = L$, then we know $l = D$, and by rule T-LOADW, we know $\Upsilon(r_y) = L$, which implies, by assumption 2, $n_1 = n_2$. Further, since $l = D$, by assumption 3, we know $b_1 = b_2$. Therefore $R'_1(r_x) = b_1(n_1) = b_2(n_2) = R'_2(r_x)$.

For conclusion 4, all we need to show is that for $i = 1, 2$, either $(\ell = H \lor \vdash_{\text{const}} M_i[k, Sym(r_y)]) \land \vdash_{\text{safe}} M_i[k, Sym(r_y)]$ is not true, $\ell = L$ or

$$(S_i, M_i[k, Sym(r_2) \mod size(b_i)]) \downarrow b_i(n_i)$$

holds true. First of all, $\vdash_{\text{const}} M_i[k, Sym(r_y)]$ is not true. W.L.O.G. we suppose $\vdash_{\text{safe}} M_i[k, Sym(r_y)]$ and $\ell = H$, then we know $l = D$ and $\vdash_{\text{safe}} Sym(r_y)$. Therefore we have $(S_i, Sym(r_y) \mod size(b_i)) \downarrow R(r_y) \mod size(b_i) = n_i$. Further, by conclusion 3, we know $\Upsilon'(k) = mem(S_i(k)) = D$. Combining with $b_i = block(S_i(k))$, we have

$$(S_i, M_i[k, Sym(r_2) \mod size(b_i)]) \downarrow b_i(n_i)$$

**Case T-IDB.** If $I = k \leftarrow \text{idb} r$, and $I$ is typed using rule T-IDB, then we
have $pc' = pc'' = 1$. Further, we know

$$(b_i, (l_i, n_i)) = S_i(k) \quad R'_i = R_i[r \mapsto n_i] \quad i = 1, 2$$

Conclusions 1, 3, 5, and 6 hold trivially. Conclusions 2 and 4 are implied by assumptions 3 and 5 respectively.

**Case T-STOREW.** If $I = \text{stw} \ r_x \rightarrow k[r_y], \text{ and } I$ is typed using rule T-STOREW, then we have $pc' = pc'' = 1$. Further, we know

$$(b_i, a_i) = S_i(k) \quad n_i = |R_i(r_y) \mod \text{size}(b)|$$

$$S'_i = S_i[k \mapsto (b_i[n_i \mapsto R_i(r_x)], a_i)] \quad i = 1, 2$$

Since registers and memories are not changed, conclusions 1, 2, and 4 hold true. Further, since $t_1 = f = t_2$, we know conclusion 6 is true. We now prove conclusions 3 and 5. Clearly, we only need to prove for $\Upsilon'(k)$ and $Sym(k)$. Suppose $a_i = (l_i, idx_i)$, then we know $\Upsilon'(k) = \Upsilon(k) = l_1 = l_2$. Further, if $\Upsilon'(k) = \Upsilon(k) = D$, then we know $\Upsilon(r_x) = \Upsilon(r_y) = L$, which, by assumption 2, implies $R_1(r_x) = R_2(r_x)$ and $R_1(r_y) = R_2(r_y)$. Therefore $n_1 = n_2$. Since $\Upsilon(k) = D$, by assumption 4, we have $S_1(k) = S_2(k)$, and thus $b_1 = b_2$. Therefore we have $S'_1(k) = (b_1[n_1 \mapsto R_1(r_x)], a_1) = (b_2[n_2 \mapsto R_2(r_x)], a_2) = S'_2(k)$. Finally, if $\Upsilon'(k) = \Upsilon(k) = E$, then by assumption 5, we know $idx_1 = idx(S_1(k)) = idx(S_2(k)) = idx_2$. Then we have $idx(S'_1(k)) = idx_1 = idx_2 = idx(S'_2(k))$. Therefore, conclusion 3 is true.

For conclusion 5, first we suppose $\ell = H$ and $\vdash_{\text{safe}} Sym'(k)$. Since $Sym'(k) = Sym(k)$, we know $\vdash_{\text{safe}} Sym(k)$, and thus by assumption 5, we know for some
\(n_1, n_2, (S_i, Sym'(k)) \Downarrow n\) and \(|n|\tau(k) = idx_i\). Further, since \(\ell = H\), we know \(slab(\tau(k)) = H\), and thus \(mem(S_i(k)) = l_i \neq D\) for \(i = 1, 2\). Now we prove that if \((S_1, sv) \Downarrow v\), then \((S_1', sv) \Downarrow v\) by induction on \(sv\). If \(sv = M_t[k', sv_1]\), then we know \(mem(S_1(k')) = l = D\), and \((S_1, sv) \Downarrow v_2\). Since \(mem(S_1(k)) \neq D\), we know \(k \neq k'\). Therefore we know \(S_1'(k) = S_1(k)\), and thus by induction assumption, we have \((S_1', sv) \Downarrow v\). Next, if \(sv = n\), then the conclusion holds trivially. If \(sv = sv_1 aop sv_2\), then we know \((S_1, sv_1) \Downarrow v_1\), \((S_1, sv_2) \Downarrow v_2\), and \(v = v_1 aop v_2\). Therefore we know \((S_1', sv) \Downarrow v\) by induction. Similarly, we can prove that if \((S_2, sv) \Downarrow v\), then \((S_2', sv) \Downarrow v\). Therefore, we know \((S_i', Sym'(k)) \Downarrow n_i\), and \(|n_i \mod size(\tau'(k))| = idx_i = idx(S_i'(k))\), which means conclusion 5 holds true.

Similarly, if \(\vdash_{const} Sym'(k)\) and \(\vdash_{safe} Sym'(k)\), we can also prove conclusion 5 easily.

**Case T-BOP.** If \(I = r_1 \leftarrow r_2 aop r_3\), and \(I\) is typed using rule T-BOP, then we have \(pc' = pc'' = 1\). Further, we know

\[
n_i = R_i(r_2) aop R_i(r_3) \quad R'_i = R_i[r_1 \mapsto n_i] \quad i = 1, 2
\]

Conclusions 1, 3, 5, and 6 are trivial. For conclusion 2, we only need to prove if \(\tau'(r_1) = l' = \tau(r_2) \sqcup \tau(r_3) = L\), then \(R_1(r_1) = n_1 = R_2(r_1) = n_2\). To see this, since \(\tau(r_2) \sqcup \tau(r_3) = L\), we know \(\tau(r_2) = \tau(r_3) = L\). Therefore by assumption 2, we know \(R_1(r_2) = R_2(r_2)\) and \(R_1(r_3) = R_2(r_3)\). Therefore, we have

\[
n_1 = R_1(r_2) aop R_1(r_3) = R_2(r_2) aop R_2(r_3) = n_2
\]
For conclusion 4, we only need to consider $Sym'(r_1)$. If

$$\vdash_{\text{safe}} Sym'(r_1) = Sym(r_2) \ aop \ Sym(r_3)$$

then we know $\vdash_{\text{safe}} Sym(r_2)$ and $\vdash_{\text{safe}} Sym(r_3)$. Further, we know $\ell = H$ or $\vdash_{\text{const}} Sym(r_1)$ and $\vdash_{\text{const}} Sym(r_2)$. Therefore, by assumption 6, we know

$$(S_i, Sym(r_2)) \downarrow R_i(r_2), (S_i, Sym(r_3)) \downarrow R_i(r_3) \text{ for } i = 1, 2$$

Therefore, we have

$$(S_i, Sym'(r_1)) \downarrow R_i(r_2) \ aop \ R_i(r_3) = n_i = R'_i(r_1) \text{ for } i = 1, 2$$

Therefore conclusion 4 holds true.

If $I = r \leftarrow n$, and $I$ is typed using rule T-ASSIGN, then we have $pc = 0$ and $pc' = pc'' = 1$. Further, we know

$$R'_i = R_i[r_1 \mapsto n] \text{ for } i = 1, 2$$

Similar to the T-BOP rule, conclusions 1, 3, 5 and 6 are trivial. For conclusion 2, we know $R'_1(r_1) = R'_2(r_1) = n$. For conclusion 4, we have $(S_i, n) \downarrow n$ for $i = 1, 2$.

**Case T-NOP.** If $I = \text{nop}$, and $I$ is typed using T-NOP, then this is trivial, because $M_i = M'_i$, $R_i = R'_i$, $S_i = S'_i$ for $i = 1, 2$, and $\Upsilon = \Upsilon'$ and $Sym = Sym'$. 

211
Case T-SUB. If $I$ is typed using T-SUB, then we know

$$\ell \vdash \iota : \langle \Upsilon, Sym \rangle \to \langle \Upsilon'', Sym'' \rangle; T$$

where $\Upsilon'' \preceq \Upsilon'$ and $Sym'' \preceq Sym'$. W.L.O.G, we assume $\ell \vdash \iota : \langle \Upsilon, Sym \rangle \to \langle \Upsilon'', Sym'' \rangle; T$ is not derived by rule T-SUB (i.e. $\preceq$ is transitive). By induction, we have the following results

1. $M_1' \sim_L M_2'$;

2. $\Upsilon \vdash R_1' \sim R_2'$;

3. $\Upsilon \vdash S_1' \sim S_2'$;

4. $\forall r \in \text{Registers}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} Sym''(r)) \land \vdash_{\text{safe}} Sym''(r) \Rightarrow (S_1', Sym''(r)) \Downarrow R_1'(r)$;

5. $\forall k \in \text{BlockIDs}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} Sym''(r)) \land \vdash_{\text{safe}} Sym''(k) \Rightarrow \exists n. (S_1', Sym'(k)) \Downarrow n \land n \mod \text{size}(\Upsilon'(k)) \mid = idx(S_1'(k))$.

6. $t_1 \equiv t_2$

Therefore, conclusions 1 and 6 follow results 1 and 6. For conclusion 2, if $\Upsilon'(r) = L$, then since $\Upsilon''(r) \sqsubseteq \Upsilon'(r)$, we know $\Upsilon''(r) = L$, and thus $R_1'(r) = R_2'(r)$.

Conclusion 3 naturally holds true, since $\Upsilon''(k) = \Upsilon'(k)$. For conclusion 4, since $Sym'' \preceq Sym'$, we know $Sym'(r) =?'$ or $Sym'(r) = Sym''(r)$. In the former case, $\vdash_{\text{safe}} Sym'(r)$ does not hold, and thus conclusion 4 is vacuously true. In the later
case, conclusion 4 directly follows results 4. Similarly, we can prove conclusion 5 as well.

**Case T-LOOP.** If \( I \) is typed using rule T-LOOP, then we know \( I = I_c; \iota_1; I_b; \iota_2, \) and

\[
\ell \vdash I_c : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle
\]

\[
\ell \vdash I_b : \langle \Upsilon', Sym' \rangle \rightarrow \langle \Upsilon, Sym \rangle
\]

We prove the conclusion by induction on number of times that instruction \( \iota_1 \) is executed. First, assume \( \iota_1 \) is executed once. Therefore, by Lemma 16, we know that

\[
I_c \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1^1} (R''_1, S''_1, M'_1, |I_c|)
\]

\[
I \vdash (R''_1, S''_1, M'_1, |I_c|) \rightarrow_{t_1^1} (R'_1, S'_1, M'_1, pc')
\]

\[
I_c \vdash (R_2, S_2, M_2, 0) \rightarrow_{t_2^1} (R''_2, S''_2, M''_2, |I_c|)
\]

\[
I \vdash (R''_2, S''_2, M''_2, |I_c|) \rightarrow_{t_2^1} (R'_2, S'_2, M'_2, pc'')
\]

\[
t_1 = t_1^1 @ t_1^2 \quad t_2 = t_2^1 @ t_2^2
\]

By induction assumption, we have

1. \( M''_1 \sim_L M''_2 \);

2. \( \Upsilon \vdash R''_1 \sim R''_2 \);

3. \( \Upsilon \vdash S''_1 \sim S''_2 \);
4. \( \forall r \in \text{Registers}, i \in \{1, 2\}. (\ell = H \land \vdash_{\text{const}} \text{Sym}''(r)) \land \vdash_{\text{safe}} \text{Sym}''(r) \Rightarrow (S_i'', \text{Sym}''(r)) \Downarrow R_i''(r) \);

5. \( \forall k \in \text{BlockIDs}, i \in \{1, 2\}. (\ell = H \land \vdash_{\text{const}} \text{Sym}''(r)) \land \vdash_{\text{safe}} \text{Sym}''(k) \Rightarrow \exists n. (S_i'', \text{Sym}''(k)) \Downarrow n \land n \mod \text{size}(\Upsilon''(k)) = idx(S_i''(k)) \).

6. \( t_1^1 \equiv t_2^1 \)

Further, we know \( I(|I_c|) = \iota_1 = \textbf{br} \ r_1 \text{ rop } r_2 \hookrightarrow n_1 \), where \( \Upsilon(r_1) \cup \Upsilon(r_2) \subseteq L \), which implies \( \Upsilon(r_1) = \Upsilon(r_2) = L \). Therefore, we know \( R_i''(r_1) = R_i''(r_2) \) and \( R_i''(r_1) = R_i''(r_2) \), which implies if

\[
I \vdash (R_i'', S_i'', M_i'', |I_c|) \rightarrow_f (R_i^*, S_i^*, M_i^*, \text{pc}^*)
\]

then \( \text{pc}^*_i = \text{pc}^*_2 \), which is either \(|I_c| + 1\) or \(|I_c| + n_1 = |I|\). If the first case is true, then we can show

\[
I_c \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1^1} (R_i'', S_i'', M_i'', |I_c|)
\]

\[
I \vdash (R_i'', S_i'', M_i'', |I_c|) \rightarrow_{t_1^2} (R_i'', S_i'', M_i'', |I_c| + 1)
\]

the branch instruction \( \iota_1 \) will not be taken.

\[
I \vdash (R_i'', S_i'', M_i'', |I_c| + 1) \rightarrow_{t_1^2} (R_i'', S_i'', M_i'', |I| - 1)
\]

\[
I \vdash (R_i'', S_i'', M_i'', |I| - 1) \rightarrow_{t_1^3} (R_i'', S_i'', M_i'', 0)
\]

\[
I \vdash (R_i'', S_i'', M_i'', 0) \rightarrow_{t_1^5} (R_1', S_1', M_1', |I|)
\]

Then by the same analysis, we know during \( I \vdash (R_i'', S_i'', M_i'', 0) \rightarrow_{t_1^3} (R_1', S_1', M_1', |I|) \), the instruction \( \iota_1 \) will be executed at least once, and thus it will be executed at least
twice in total, which contradicts our assumption that \( \iota_1 \) is executed only once. Therefore, we know \( pc_1^* = pc_2^* = |I| \). Therefore, we know \( t_1 = t_1^1 \circ f \equiv t_2^2 \circ f = t_2 \), and \( R_i'' = R_i, S_i'' = S_i, M_i'' = M_i \) \((i = 1, 2)\). In this case, conclusions 1-6 all hold true.

Next, we assume the conclusions hold true for the number of the times that \( \iota_1 \) is executed less than \( u > 1 \), and we consider the case when \(|t_1| = u\). By the same analysis as above, we know

\[
I_c \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1^1} (R_1'', S_1'', M_1'', |I_c|) \\
I \vdash (R_1'', S_1'', M_1'', |I_c|) \rightarrow_I (R_1'', S_1'', M_1'', |I_c| + 1) \\
\text{the branch instruction } \iota_1 \text{ will not be taken.}
\]

\[
I \vdash (R_1'', S_1'', M_1'', |I_1| + 1) \rightarrow_{t_1^2} (R_1'', S_1'', M_1'', |I_1| - 1) \\
I \vdash (R_1'', S_1'', M_1'', |I_1| - 1) \rightarrow_I (R_1'', S_1'', M_1'', 0) \\
I \vdash (R_1'', S_1'', M_1'', 0) \rightarrow_{t_1^3} (R_1', S_1', M_1', |I_1|) \\
I_c \vdash (R_2, S_2, M_2, 0) \rightarrow_{t_2^1} (R_2'', S_2'', M_2'', |I_c|) \\
I \vdash (R_2'', S_2'', M_2'', |I_c|) \rightarrow_I (R_2'', S_2'', M_2'', |I_c| + 1) \\
\text{the branch instruction } \iota_1 \text{ will not be taken.}
\]

\[
I \vdash (R_2'', S_2'', M_2'', |I_2| + 1) \rightarrow_{t_2^2} (R_2'', S_2'', M_2'', |I_2| - 1) \\
I \vdash (R_2'', S_2'', M_2'', |I_2| - 1) \rightarrow_I (R_2'', S_2'', M_2'', 0) \\
I \vdash (R_2'', S_2'', M_2'', 0) \rightarrow_{t_2^3} (R_2', S_2', M_2', |I_2|)
\]

\[
t_1 = t_1^1 \circ f \circ t_1^2 \circ f \circ t_1^3 \quad t_2 = t_2^1 \circ f \circ t_2^2 \circ f \circ t_2^3
\]
By induction assumption, we know

1. $M_1'' \sim_L M_2'', M_1'' \sim_L M_2''$, and $M_1' \sim_L M_2'$;

2. $\Upsilon \vdash R_1'' \sim R_2''$, $\Upsilon \vdash R_1''' \sim R_2'''$, and $\Upsilon \vdash R_1' \sim R_2'$;

3. $\Upsilon \vdash S_1'' \sim S_2''$, $\Upsilon \vdash S_1''' \sim S_2'''$, and $\Upsilon \vdash S_1' \sim S_2'$;

4. $\forall r \in \text{Registers}, i \in \{1, 2\}.(\ell = H \land \vdash \text{const} \ Sym''(r)) \land \vdash \text{safe} \ Sym''(r) \Rightarrow$
   
   $(S_i'', Sym''(r)) \downarrow R''_i(r)$;

5. $\forall r \in \text{Registers}, i \in \{1, 2\}.(\ell = H \land \vdash \text{const} \ Sym'''(r)) \land \vdash \text{safe} \ Sym'''(r) \Rightarrow$
   
   $(S_i''', Sym'''(r)) \downarrow R'''_i(r)$;

6. $\forall r \in \text{Registers}, i \in \{1, 2\}.(\ell = H \land \vdash \text{const} \ Sym'(r)) \land \vdash \text{safe} \ Sym'(r) \Rightarrow$
   
   $(S_i', Sym'(r)) \downarrow R'_i(r)$;

7. $\forall k \in \text{BlockIDs}, i \in \{1, 2\}.(\ell = H \land \vdash \text{const} \ Sym''(r)) \land \vdash \text{safe} \ Sym''(k) \Rightarrow$
   
   $\exists n.(S_i'', Sym''(k)) \downarrow n$
   
   $\land |n \mod \text{size}(\Upsilon''(k))| = idx(S_i''(k))$.

8. $\forall k \in \text{BlockIDs}, i \in \{1, 2\}.(\ell = H \land \vdash \text{const} \ Sym'''(r)) \land \vdash \text{safe} \ Sym'''(k) \Rightarrow$
   
   $\exists n.(S_i''', Sym'''(k)) \downarrow n$
   
   $\land |n \mod \text{size}(\Upsilon'''(k))| = idx(S_i'''(k))$.

9. $\forall k \in \text{BlockIDs}, i \in \{1, 2\}.(\ell = H \land \vdash \text{const} \ Sym'(r)) \land \vdash \text{safe} \ Sym'(k) \Rightarrow$
   
   $\exists n.(S_i', Sym'(k)) \downarrow n$
   
   $\land |n \mod \text{size}(\Upsilon'(k))| = idx(S_i'(k))$.

10. $t_1^1 \equiv t_1^2$, $t_1^2 \equiv t_2^2$, and $t_1^3 \equiv t_2^3$
Then conclusions 1-5 are true by the above results, and for conclusion 6, we have
\[ t_1 = t_1^1 @ f @ t_1^2 @ f @ t_1^3 \equiv t_2^1 @ f @ t_2^2 @ f @ t_2^3 = t_2. \]

Case T-IF. Thus \( I = \ell_1; I_t; \ell_2; I_f \), where \( \ell_1 = \text{br\ } r_1 \text{ rop } r_2 \rightarrow n_1 \). Consider \( \ell' \).
If \( \ell' = L \), then we know \( \Upsilon(r_1) = \Upsilon(r_2) = L \). In this case, by assumption 2, we know \( R_1(r_1) = R_2(r_1) \) and \( R_1(r_2) = R_2(r_2) \). Therefore the program counter goes to the same value in both cases. Formally speaking, we have
\[
I \vdash \langle R_i, S_i, M_i, 0 \rangle \rightarrow_f \langle R_i, S_i, M_i, pc_i \rangle \quad i = 1, 2
\]
where \( pc_1 = pc_2 \), which is either 1, or \( n_1 = |I_t| + 2 \). If \( pc_1 = 1 \), then by Lemma 16, we know
\[
I \vdash \langle R_i, S_i, M_i, 1 \rangle \rightarrow_{\ell_i'} \langle R_i'', S_i'', M_i'', |I_t| + 1 \rangle \quad i = 1, 2
\]
Further, since \( I(|I_t| + 1) = \ell_2 = \text{jmp } n_2 \), we know
\[
I \vdash \langle R_i'', S_i'', M_i'', |I_t| + 1 \rangle \rightarrow_f \langle R_i'', S_i'', M_i'', |I| \rangle
\]
for \( i = 1, 2 \), which implies \( R_i'' = R_i', S_i'' = S_i', M_i'' = M_i' \), and \( t_i = f@t_i'@f \). It is easy to prove, by induction assumption, that conclusions 1-5 hold true, and for 6, we have
\[
t_1 = f@t_1'@f = f@t_2'@f = t_2
\]
Similarly, if \( pc_1 = pc_2 = n_1 \), we can also prove the conclusions. Intuitively, this means if \( \ell = L \), then the branching statement in this if-block will go to the same
Figure B.2: Symbolic Execution in \( \mathcal{L}_{\text{GhostRider}} \)

branch, and we can prove the theorem.

Next, we consider when \( \ell' = \emptyset \). If the branching statement goes to the same \( pc \), then based on the above discussion, we know the conclusions hold true. With out loss of generality, we can consider when the branching instruction goes to different \( pc \).

We first study the relationship between a trace and a trace pattern. We first prove conclusion 6, i.e. \( t_1 = t_2 \). To prove this, we first define a new notion \( \langle S, M, T \rangle \rightarrow_t \langle S', M' \rangle \) as in Figure B.2.

Here, we use a special value \( \star \) to indicate a special block, that we do not care about its content. By doing so, we can use only the DRAM part of a memory \( M \) to
make this definition. It is easy to see that evaluating a symbolic value requires only
scratchpad and memory corresponding to DRAM. We defined the DRAM projection
of scratchpad and memory as follows:

\[
S_D(k) = \begin{cases} 
(b, (D, n)) & S(k) = (b, (D, n)) \\
(\star, (E, n)) & S(k) = (b, (E, n)) \\
\text{undefined} & \text{otherwise}
\end{cases}
\]

Actually, the DRAM projection of scratchpad does not contain only DRAM, but also
contain the index information about ERAM. But all these information are assumed
to be public.

\[
M_D(l, n) = \begin{cases} 
M(l, n) & l = D \\
\text{undefined} & \text{otherwise}
\end{cases}
\]

It is easy to see, if \(\langle S, M, T \rangle \rightarrow_t \langle S', M' \rangle\), then we also have

\(\langle S_D, M_D, T \rangle \rightarrow_t \langle S'_D, M'_D \rangle\)

It is worth mentioning that given two low-equivalent memories \(M_1 \sim_L M_2\) if
and only if \(M_D^1 = M_D^2\).

Based on these rules, we can prove the following lemmas.

**Lemma 17.** Given \(S, M, S', M', S'', M'', T_1, T_2\), if \(T_1 \equiv T_2\), and

\(\langle S, M, T_1 \rangle \rightarrow_t \langle S', M' \rangle\)
\( \langle S, M, T_2 \rangle \rightarrow_{t_2} \langle S'', M'' \rangle \)

then we have

\[
\begin{align*}
t_1 & \equiv t_2 \\
S' & = S'' \\
M' & = M''
\end{align*}
\]

\textbf{Proof.} We consider how \( T_1 \equiv T_2 \) is derived. \textbf{Case 1.} \( T_1 = \text{read}(l, k, sv_1) \) and \( T_2 = \text{read}(l, k, sv_2) \). Then we know \( \vdash \text{safe} \; sv_1 \) and \( sv_1 = sv_2 \). Therefore we know if \( (S, sv_1) \Downarrow n_1 \) and \( (S, sv_2) \Downarrow n_2 \), then \( n_1 = n_2 \). If \( l = D \), then we know \( b = M(l, n_1) = M(l, n_2) \), and thus \( t_1 = \text{read}(n_1, b) = \text{read}(n_2, b) = t_2 \). Further \( S' = S[k \rightarrow (b, (l, n_1))] = S[k \rightarrow (b, (l, n_2))] = S'' \), and \( M' = M = M'' \). If \( l = E \), then \( t_1 = \text{eread}(n_1) = \text{eread}(n_2) = t_2 \). Further \( S' = S[k \rightarrow (\star, (l, n_1))] = S[k \rightarrow (\star, (l, n_2))] = S'' \) and \( M' = M'' \) is trivial. It is impossible for \( l \) to be an ORAM bank.

\textbf{Case 2.} \( T_1 = \text{write}(l, k, sv_1) \) and \( T_2 = \text{write}(l, k, sv_2) \). If \( l = D \) or \( l = E \), then \( S = (b, (l, n)) \) (where \( b = \star \) if \( l = E \)), and \( t_1 = \text{write}(n, b) = t_2 \). Further \( M' = M[(l, n) \mapsto b] = M'' \), and \( S' = S = S'' \). It is impossible for \( l \) to be an ORAM bank.

\textbf{Case 3.} \( T_1 = T_2 = o \) or \( T_1 = T_2 = F \), then the conclusion is trivial.

\textbf{Case 5.} \( T_1 = T_x @ (T_y @ T_z) \) and \( T_2 = (T_x @ T_y) @ T_z \). In this case, we know

\[
\langle S, M, T_x \rangle \rightarrow_{t_x} \langle S^2, M^2 \rangle
\]
\[ \langle S^2, M^2, T_y \circ T_z \rangle \rightarrow_{t_y} \langle S', M' \rangle \]

Further, we have

\[ \langle S^2, M^2, T_y \rangle \rightarrow_{t_y} \langle S^3, M^3 \rangle \]

\[ \langle S^2, M^2, T_z \rangle \rightarrow_{t_z} \langle S', M' \rangle \]

Therefore we have \( t_1 = t_x \circ (t_y \circ t_z) \). Further, we know

\[ \langle S, M, T_x \rangle \rightarrow_{t_x} \langle S^*, M^* \rangle \]

\[ \langle S^*, M^*, T_y \rangle \rightarrow_{t_y} \langle S^{**}, M^{**} \rangle \]

\[ \langle S^{**}, M^{**}, T_z \rangle \rightarrow_{t_z} \langle S'', M'' \rangle \]

By induction assumption, we know \( S^* = S^2, M^* = M^2, t_x \equiv t'_x \). Further, by induction assumption again, we have \( S^{**} = S^3, M^{**} = M^3, t_y \equiv t'_y \). Finally, by applying induction assumption once more, we finally have \( S' = S'', M' = M'' \), and \( t_z \equiv t'_z \). Therefore we know

\[ t_1 = t_x \circ (t_y \circ t_z) \equiv (t_x \circ t_y) \circ t_z \equiv (t_x' \circ t_y') \circ t_z' \equiv t_2 \]

**Case 6.** \( T_1 = T_x \circ T_y \) and \( T_2 = T_x' \circ T_y' \). Then we know

\[ \langle S, M, T_x \rangle \rightarrow_{t_x} \langle S^1, M^1 \rangle \]
\[ \langle S^1, M^1, T_y \rangle \rightarrow_{t_y} \langle S', M' \rangle \]

and

\[ \langle S, M, T'_x \rangle \rightarrow_{t'_x} \langle S^2, M^2 \rangle \]

\[ \langle S^2, M^2, T'_y \rangle \rightarrow_{t'_y} \langle S'', M'' \rangle \]

Since \( T_x \equiv T'_x \), we know by induction assumption, \( t_x \equiv t'_x \), \( S^1 = S^2 \), and \( M^1 = M^2 \).

Then applying induction assumption again, we have \( t_y \equiv t'_y \), \( S' = S'' \), and \( M' = M'' \).

\[ \square \]

**Lemma 18.** Given a program \( I \), register mappings \( R, R' \), scratchpads \( S, S' \), and memories \( M, M' \), if

\[ \ell \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T \]

where \( \ell = H \), \( T \equiv T' \) (for some \( T' \)), \( R, R', S, S', M, M' \) satisfy assumptions 1-7 in Theorem 6 and

\[ I \vdash \langle R, S, M, 0 \rangle \rightarrow_t \langle R', S', M', |I| \rangle \]

then we have

\[ \langle S_D, M_D, T \rangle \rightarrow_t \langle S'_D, M'_D \rangle \]

**Proof.** We prove by induction on how \( I \) is typed.

**Case T-SEQ.** We know \( I = I_1; I_2 \), and

\[ H \vdash I_1 : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon'', \text{Sym}'' \rangle; T_1 \]
\[
H \vdash I_2 : \langle \Upsilon'', \text{Sym}'' \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle; T_2
\]

where \( T = T_1 \odot T_2 \) Based on the discussion above, we know

\[
I_1 \vdash \langle R, S, M, 0 \rangle \rightarrow_{t_1} \langle R'', S'', M'', |I_1| \rangle
\]

\[
I_2 \vdash \langle R'', S'', M', 0 \rangle \rightarrow_{t_2} \langle R', S', M', |I_2| \rangle
\]

where \( t = t_1 \odot t_2 \). Therefore, by induction assumption, we know

\[
\langle S_D, M_D, T_1 \rangle \rightarrow_{t_1} \langle S''_{D}, M''_D \rangle
\]

\[
\langle S'', M'', T_2 \rangle \rightarrow_{t_1} \langle S'_D, M'_D \rangle
\]

Therefore, we have

\[
\langle S_D, M_D, T_1 \odot T_2 \rangle \rightarrow_{t_1 \odot t_2} \langle S'_D, M'_D \rangle
\]

**Case T-LOAD.** \( I = \text{l\text{db}} \ k \leftarrow l[r] \). If \( l = D \), then \( T = \text{read}(l, k, sv) \), where \( sv = \text{Sym}(r) \). Suppose \( b = M(l, n) \), where \( n = R(r) \). By assumption 2, we know \((S, sv) \Downarrow n\), therefore we know \((S_D, sv) \Downarrow n\) hold true as well. Further, \( S' = S[k \mapsto (b, (l, n))] \), and thus \( S''_D = S_D[k \mapsto (b, (l, n))] \). We also know \( M'_D = M_D \), and \( t = \text{read}(n, b) \). In conclusion, we have

\[
\langle S_D, M_D, T \rangle \rightarrow_{t} \langle S'_D, M'_D \rangle
\]

If \( l = E \), then we know \( T = \text{read}(l, k, sv) \) as well, where \( sv = \text{Sym}(r) \). By
assumption 2, we have \((S_D, sv) \downarrow n\), where \(n = R(r)\). Combining with \(t = \text{erad}(n)\), 
\(M'_D = M_D\), and \(S'_D = S_D[k \mapsto (\star, (l, n))]\), we get our conclusion.

If \(l \in \text{ORAMBanks}\), then we know \(T = l\), and \(t = l\). Combining with 
\(M'_D = M_D\), and \(S'_D = S_D\), the conclusion is trivial.

**Case T-STORE.** \(I = \text{stb } k\). Suppose \(S(k) = (b, (l, n))\). If \(l = D\) or \(l = E\), 
then \(S_D(k) = (b, (l, n))\) (where \(b = \star\) if \(l = E\)), \(T = \text{write}(l, k, sv)\), \(t = \text{write}(n, b)\), 
and \(M'_D = M_D[(l, n) \mapsto b]\). Therefore, we can get our conclusion.

If \(l \in \text{ORAMBanks}\), similar to the T-LOAD case, we can get our conclusion.

**Case T-STOREW.** \(I = \text{stw } r_1 \rightarrow k[r_2]\). Since \(\ell = H\), we know \(\text{slab}(k) = H\), 
which implies \(k = E\) or \(k \in \text{ORAMBanks}\). Therefore \(S'_D = S_D\) and \(M'_D = M_D\).

Further, since \(T = F\) and \(t = f\), we know

\[
\langle S_D, M_D, T \rangle \rightarrow_t \langle S'_D, M'_D \rangle
\]

**Case T-LOADW, T-IDB, T-BOP, T-ASSIGN and T-NOP.** In all these rules, \(T = F\), and \(t = f\). Further, it is easy to see that \(S' = S\) and \(M' = M\) in all these rules. Therefore, the conclusion

\[
\langle S_D, M_D, T \rangle \rightarrow_t \langle S'_D, M'_D, T \rangle
\]

holds true.

**Case T-UP.** The conclusion is trivial by induction assumption.

**Case T-LOOP.** This is impossible, since T-LOOP requires \(\ell = L\).
Case T-IF.  $I = \iota_1; I_t; I_2; I_f$, where $\iota_1 = \text{br } r_1 \text{ rop } r_2 \mapsto n_1$ and $\iota_2 = \text{jmp } n_2$.

Depending on the value of $R(r_1)$ and $R(r_2)$, it may jumps to one of the two branches.

If the true branch is taken, then we know $t = f@t_1@f$, where

$$I_t \vdash (R, S, M, 0) \rightarrow_{t_1} (R', S', M', |I_t|)$$

Since $I_t$ is typable, we know, by induction assumption,

$$\langle S_D, M_D, T_t \rangle \rightarrow_{t_1} \langle S'_D, M'_D \rangle$$

Since

$$\langle S_D, M_D, F \rangle \rightarrow_{f} \langle S_D, M_D \rangle$$

$$\langle S'_D, M'_D, F \rangle \rightarrow_{f} \langle S'_D, M'_D \rangle$$

and $T = F@t_1@F$, we can derive our conclusion.

If the false branch is taken, then we know $t = f@t_2$, where

$$I_f \vdash (R, S, M, 0) \rightarrow_{t_1} (R', S', M', |I_f|)$$

Therefore, we know

$$\langle S_D, M_D, T_f \rangle \rightarrow_{t_2} \langle S'_D, M'_D \rangle$$
Further, since $T_1 \@ F \equiv T_2$, by Lemma 17, we have $t_1 \@ f \equiv t_2$, which implies

$$f \@ t_1 \@ f \equiv f \@ t_2$$

Therefore, we have

$$\langle S_D, M_D, T_1 \rangle \rightarrow t \langle S_D, M_D \rangle$$

□

Now, we get back to prove Theorem 2 for T-IF rule. Let us remind that $I = i_1; I_t; i_2; I_f$.

W.L.O.G, we suppose

$$I \vdash (R_1, S_1, M_1, 0) \rightarrow \tau (R_1, S_1, M_1, 1)$$

$$I_t \vdash (R_1, S_1, M_1, 0) \rightarrow \tau_t (R'_1, S'_1, M'_1, |I_t|)$$

$$I \vdash (R'_1, S'_1, M'_1, |I_t| + 1) \rightarrow \tau (R'_1, S'_1, M'_1, |I|)$$

and

$$I \vdash (R_2, S_2, M_2, 0) \rightarrow \tau (R_2, S_2, M_2, |I_t| + 2)$$

$$I_f \vdash (R_2, S_2, M_2, 0) \rightarrow \tau_f (R'_2, S'_2, M'_2, |I_f|)$$

By Lemma 18, we know

$$\langle S_{1D}, M_{1D}, T_1 \rangle \rightarrow \tau (S'_{1D}, M'_{1D})$$
\[ \langle S_{2D}, M_{2D}, T_2 \rangle \rightarrow_{t_f} \langle S'_{2D}, M'_{1D} \rangle \]

Since \( \langle S'_{1D}, M'_{1D}, F \rangle \rightarrow_{f} \langle S'_{1D}, M'_{1D} \rangle \), we have

\[ \langle S_{1D}, M_{1D}, T_1 \circ F \rangle \rightarrow_{t_1 \circ f} \langle S'_{1D}, M'_{1D} \rangle \]

Further, by assumption 1 and 2, we know \( S_{1D} = S_{2D} \) and \( M_{1D} = M_{2D} \). By Lemma 17, we know

\[ t_1 \circ f \equiv t_2 \]

\[ S'_{1D} = S'_{2D} \]

\[ M'_{1D} = M'_{2D} \]

Therefore, we have \( t_1 = f \circ t \circ f \equiv f \circ t_f = t_2 \), which is conclusion 6. Further, the last two assertions show that conclusions 1 and 3 are true. So we only need to prove conclusions 2, 4, and 5.

The first step is to prove conclusion 4. If \( \vdash_{safe} Sym(r) \) is not true, then conclusion 4 is vacuum. Otherwise, since \( \ell' = H \), we know either \( \ell = \text{H} \) or \( \vdash_{const} Sym(r) \). In either case, assumption 4, we know \( (S_i, Sym(r)) \Downarrow R_i(r) \) for \( i = 1, 2 \).

Then, by induction, we conclude that conclusion 4 holds true: if \( (\ell' = H \lor \vdash_{const} Sym'(r)) \land \vdash_{safe} Sym'(r) \), then \( (S'_i, Sym'(r)) \Downarrow R'_i(r) \) (\( i = 1, 2 \)). Since \( (\ell = H \lor \vdash_{const} Sym'(r)) \land \vdash_{safe} Sym'(r) \) implies \( (\ell' = H \lor \vdash_{const} Sym'(r)) \land \vdash_{safe} Sym'(r) \), we know conclusion 4 is true. We can prove conclusion 5 similarly.

The next step is to prove conclusion 2, suppose \( \Upsilon(r) = L \), then by T-IF,
(since $\ell' = H$) we know $\vdash_{\text{safe}} Sym(r)$ and $\vdash_{\text{const}} Sym(k)$. Therefore, by conclusion 4, we know $(S'_{1D}, Sym(r)) \Downarrow R_1(r)$, and $(S'_{2D}, Sym(r)) \Downarrow R_2(r)$. However, since $S'_{1D} = S'_{2D}$, we know $R_1(r) = R_2(r)$. This means conclusion 4 implies conclusion 2.

\[ \square \]

**Theorem 7.** Given a program $I$ in $\mathcal{L}_T$, such that $\ell \vdash I : (\Upsilon, Sym) \rightarrow (\Upsilon', Sym'); T$, two memories $M_1, M_2$, two register mapping $R_1, R_2$, and two scratchpad mapping $S_1, S_2$, if the following assumptions are satisfied:

1. $M_1 \sim_L M_2$
2. $\Upsilon \vdash R_1 \sim R_2$
3. $\Upsilon \vdash S_1 \sim S_2$
4. $\forall r \in \text{Registers}, i \in \{1, 2\}.(\ell = H \lor \vdash_{\text{const}} Sym(r)) \land \vdash_{\text{ok}} Sym(r) \Rightarrow (S_i, Sym(r)) \Downarrow R_i(r)$;
5. $\forall k \in \text{BlockIDs}, i \in \{1, 2\}.(\ell = H \lor \vdash_{\text{const}} Sym(r)) \land \vdash_{\text{ok}} Sym(k) \Rightarrow \exists n.(S_i, Sym(k)) \Downarrow n \land |n \ mod \ \text{size}(\Upsilon(k))| = idx(S_i(k))$.

and

\[ I \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1} (R'_{1}, S'_1, M'_1, pc') \]

\[ I \vdash (R_2, S_2, M_2, 0) \rightarrow_{t_2} (R'_{2}, S'_2, M'_2, pc'') \]

If $|t_1| = |t_2|$, then we have the following conclusions:

1. $M'_1 \sim_L M'_2$
2. $\Upsilon' \vdash R'_1 \sim R'_2$

3. $\Upsilon' \vdash S'_1 \sim S'_2$

4. $\forall r \in \text{Registers}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} Sym(r) \land \vdash_{\text{safe}} Sym'(r) \Rightarrow (S'_i, Sym'(r)) \downarrow R'_i(r))$

5. $\forall k \in \text{BlockIDs}, i \in \{1, 2\}. (\ell = H \lor \vdash_{\text{const}} Sym(r) \land \vdash_{\text{safe}} Sym'(k) \Rightarrow \exists n. (S'_i, Sym'(k)) \downarrow n \land |n mod \text{size}(\Upsilon'(k))| = idx(S'_i(k))$.

6. $t_1 \equiv t_2$

Proof. Next, we prove the non-terminating case. Again, we suppose

$$\ell \vdash I: \langle \Upsilon, Sym \rangle \to \langle \Upsilon', Sym' \rangle; T$$

and

$I \vdash \langle R_1, S_1, M_1, 0 \rangle \rightarrow_{t_1} \langle R'_1, S'_1, M'_1, pc_1 \rangle$

$I \vdash \langle R_2, S_2, M_2, 0 \rangle \rightarrow_{t_2} \langle R'_2, S'_2, M'_2, pc_2 \rangle$

where $|t_1| = |t_2|$. Then we prove by induction on how to derive $L \vdash I: \langle \Upsilon, Sym \rangle \to \langle \Upsilon', Sym' \rangle; T$. If it is derived by applying one of rules T-LOAD, T-STORE, T-LOADW, T-STOREW, T-BOP, T-ASSIGN, T-NOP, then we know $|t_1| = |t_2| = 1$, and the conclusion follows directly from Theorem 6. If it is derived by applying rule T-UP, then the conclusion is trivial. So we only need to consider rule T-IF, T-LOOP, and T-SEQ.
**Case T-SEQ.** Suppose $I = I_1; I_2$. We prove by contradiction. Without loss of generality, we suppose $|t_1| = |t_2|$ is minimal such that $t_1 \not \equiv t_2$ or $M'_1 \not \sim_L M'_2$.

There are two sub cases:

$I_1 \vdash (R_1, S_1, M_1, 0) \rightarrow_{t'_1} (R''_1, S''_1, M''_1, |I_1|)$

$I_2 \vdash (R''_1, S''_1, M''_1, 0) \rightarrow_{t''_1} (R'_1, S'_1, M'_1, pc_1 - |I_1|)$

where $pc_1 > |I_1|$ and $t_1 = t'_1@t''_1$, Or

$I_1 \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1} (R''_1, S''_1, M''_1, pc_1)$

In the first case, by assuming $|t_1|$ is minimal, we know

$I_1 \vdash (R_2, S_2, M_2, 0) \rightarrow_{t'_2} (R'_2, S'_2, M'_2, |I_2|)$

$I_2 \vdash (R'_2, S'_2, M'_2, 0) \rightarrow_{t''_2} (R_2, S_2, M_2, pc_2 - |I_1|)$

where $pc_2 > |I_1|$. Then by induction assumption, we can prove that $t'_1 \equiv t'_2$ and $t''_1 \equiv t''_2$, and thus $t_1 \equiv t_2$. In the second case, by induction assumption, we directly prove that $t_1 \equiv t_2$.

**Case T-LOOP.** Suppose $I = I_c; I_1; I_b; I_2$. Since $I_c$ is typable, by Lemma 16, we know that By Lemma 16, we know either one of the following two cases happens:

$I_c \vdash (R_1, S_1, M_1, 0) \rightarrow_{t'_1} (R''_1, S''_1, M''_1, |I_c|)$
\[ I \vdash (R''_1, S''_1, M''_1, |I_c|) \rightarrow_{t_1} (R'_1, S'_1, M'_1, pc') \]

or

\[ I_c \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1} (R_1, S_1, M_1', pc') \]

where \( pc' \leq |I_c| \).

In the latter case, the conclusion follows by induction assumption. For the former case, we know

\[ I_c \vdash (R_1, S_1, M_1, 0) \rightarrow_{t_1} (R''_1, S''_1, M''_1, |I_c|) \]

\[ I \vdash (R''_1, S''_1, M''_1, |I_c|) \rightarrow_{t_2} (R'_1, S'_1, M'_1, pc') \]

Similarly, we have

\[ I_c \vdash (R_2, S_2, M_2, 0) \rightarrow_{t_1} (R''_2, S''_2, M''_2, |I_c|) \]

\[ I \vdash (R''_2, S''_2, M''_2, |I_c|) \rightarrow_{t_2} (R'_2, S'_2, M'_2, pc'') \]

By Theorem 6, we have

1. \( M''_1 \sim_L M''_2 \);
2. \( \Upsilon \vdash R_1 \sim R_2 \);
3. \( \Upsilon \vdash S_1 \sim S_2 \);
4. \( \forall r \in \text{Registers}, i \in \{1, 2\}, (\ell = H \land \vdash_{\text{const}} Sym'(r)) \land \vdash_{\text{safe}} Sym'(r) \Rightarrow \)
\[(S''_i, \text{Sym}''(r)) \Downarrow R'_i(r);\]

5. \(\forall k \in \text{BlockIDs}, i \in \{1, 2\}. (\ell = H \wedge \vdash_{\text{const}} \text{Sym}'(r)) \wedge \vdash_{\text{safe}} \text{Sym}'(k) \Rightarrow \exists n.(S''_i, \text{Sym}''(k)) \Downarrow n \wedge |n\text{Sym}''| = \text{idx}(S''_i(k)). \]

6. \(t_1^1 \equiv t_2^1\)

Further, we know \(I(|I_c|) = \ell_1 = \text{br } r_1 \text{ rop } r_2 \rightarrow n_1,\) where \(\text{Sym}(r_1) \sqcup \text{Sym}(r_2) \subseteq L,\) which implies \(\text{Sym}(r_1) = \text{Sym}(r_2) = L.\) Therefore, we know \(R''_1(r_1) = R''_2(r_1)\) and \(R''_1(r_2) = R''_2(r_2),\) which implies if

\[I \vdash (R''_i, S''_i, M''_i, |I_c|) \rightarrow_{f} (R''_i, S''_i, M''_i, pc^*_i)\]

then \(pc^*_1 = pc^*_2,\) which is either \(|I_c| + 1\) or \(|I_c| + n_1 = |I|.\) If later, then we know \(t_1 = t_1^1@f \equiv t_2^1@f = t_2,\) and \(R''_i = R_i, S''_i = S_i, M''_i = M_i (i = 1, 2).\) In this case, conclusions 1-6 all hold true.

In the former case, there are still two sub cases: (1)

\[I_b \vdash (R''_1, S''_1, M''_1, 0) \rightarrow_{t_2^1} (R''_1, S''_1, M''_1, pc_1)\]

\[I_b \vdash (R''_2, S''_2, M''_2, 0) \rightarrow_{t_2^1} (R''_2, S''_2, M''_2, pc_2)\]

Then by induction, we know \(t_2^1 \equiv t_2^2,\) and therefore \(t_1 \equiv t_2,\) and all conclusions 1-5 are true.

(2)

\[I_b \vdash (R''_1, S''_1, M''_1, 0) \rightarrow_{t_2^2} (R''_1, S''_1, M''_1, |I_b|)\]
\[ I_b \vdash (R''_2, S''_2, M''_2, 0) \rightarrow_{\bar{t}'} (R'_2, S'_2, M'_2, I_b) \]
\[ I \vdash (R'_3, S'_3, M'_3, 0) \rightarrow_{t'_i} (R'_{i_1}, S'_{i_1}, M'_{i_1}, |I_b|) \]
\[ I \vdash (R''_3, S''_3, M''_3, 0) \rightarrow_{\bar{t}''} (R'_3, S'_3, M'_3, |I_b|) \]

where \( t_i = t'_i \oplus \bar{f} \oplus \bar{t}'' \oplus \bar{f} \oplus t'_i \) (\( i = 1, 2 \)). Similar to Theorem 6, we can show the conclusions 1-6 hold true for this case.

**Case T-IF.** Suppose \( I = \bar{\nu}_1; I_i; \nu_2; I_f \), where \( \nu_1 = \text{br} \ r_1 \text{ rop} r_2 \rightarrow n_1 \). We need the following lemma.

**Lemma 19.** If \( H \vdash I : \langle \Upsilon, \text{Sym} \rangle \rightarrow \langle \Upsilon', \text{Sym}' \rangle : T \), and

\[ I \vdash (R, S, M, pc) \rightarrow_t (R', S', M', pc') \]

Then \( pc' > pc \).

**Proof (sketch).** Since while-statement cannot be typed in high security context, and while-statement is the only place allowing jumping or branching back, in high security context, the program counter will only increase. \( \square \)

If \( \Upsilon(r_1) = \Upsilon(r_2) = L \), then we know

\[ I \vdash (R_i, S_i, M_i, 0) \rightarrow_t (R_i, S_i, M_i, pc) \]

\[ I \vdash (R_i, S_i, M_i, pc) \rightarrow_{t'_i} (R'_{i_1}, S'_{i_1}, M'_{i_1}, pc_i) \]

where \( t_i = \bar{f} \oplus t'_i \) for \( i = 1, 2 \). In this case, we can prove the conclusions by induction
assumption.

If \( \ell \sqcup \Upsilon(r_1) \sqcup \Upsilon(r_2) = H \), then by Lemma 19, we know

\[
I \vdash (R_i, S_i, M_i, 0) \rightarrow_{t'_i} (R_i, S_i, M_i, |I|)
\]

for \( i = 1, 2 \), where \( t_1 \) and \( t_2 \) are prefixes of \( t'_1 \) and \( t'_2 \). By Theorem 6, we know \( t'_1 \equiv t'_2 \).

Therefore, we know \( t_1 \equiv t_2 \). Conclusion 1-5 can be proven similar to Theorem 6.

\[\square\]

**Proof of Theorem 2.** Suppose \( I \) is a program, \( \Upsilon \) and \( Sym \) satisfy: (1) \( \forall r. Sym(r) = ? \land \Upsilon(r) = L \); and (2) \( \forall k. Sym(k) = ? \land \Upsilon(k) = D \). There are \( \Upsilon' \) and \( Sym' \), such that

\[
\ell \vdash I : \langle \Upsilon, Sym \rangle \rightarrow \langle \Upsilon', Sym' \rangle; T
\]

where \( \ell = L \), and

\[
I \vdash (R_0, S_0, M_1, 0) \rightarrow_{t_1} (R_1, S_1, M'_1, pc_1)
\]

\[
I \vdash (R_0, S_0, M_2, 0) \rightarrow_{t_2} (R_2, S_2, M'_2, pc_2)
\]

where \( M_1 \sim_L M_2, \forall r. R_0(r) = 0, \forall k. S_0(k) = (b_0, (D, 0)), \) and \( |t_1| = |t_2| \). Clearly, we have

1. \( M_1 \sim_L M_2 \)

2. \( \Upsilon \vdash R_0 \sim R_0; \)

3. \( \Upsilon \vdash S_0 \sim S_0; \)
4. $\forall r \in \text{Registers}, k \in \text{Blocks}. \ \not\exists \text{safe Sym}(r) \land \not\exists \text{safe Sym}(k)$;

Therefore, by applying Theorem 7, we have $t_1 \equiv t_2$, and $M_1' \sim L M_2'$.
Appendix C: Proof of Theorem 3

We begin by discussing how to construct $sim_A$; the simulator $sim_B$ is constructed similarly.

Since Alice does not have the view of Bob’s local data, and those data secret-shared between them two, we define a special notion $\bullet$ as the values not observable to Alice. We define the operations on top of $\bullet$ as follows:

\[
\bullet \ op \ v = \bullet \quad v \ op \bullet = \bullet \quad \bullet \ (v) = \bullet \quad m(\bullet) = \bullet
\]

We define the following auxiliary functions accordingly:

\[
(select_A(l, t, t'), select_B(l, t, t')) := select(l, t, t')
\]

\[
read_A(l, v) := \begin{cases} v \ l \subseteq A \\ \bullet \text{ otherwise} \end{cases}
\]

\[
val(v, l) := v
\]

\[
val(m, l) := m
\]

\[
lab(v, l) := l
\]

\[
lab(m, l) := l
\]
We then define Alice’s snapshot of a memory $M$, denoted as $M \downarrow A$, in the following:

**Definition 11.** Given a memory $M$, Alice’s snapshot of $M$, denoted as $M \downarrow A$, is defined as a memory such that

$$M \downarrow A(x) = \begin{cases} 
M(x) & \text{if } M(x) = (v, l) \text{ where } l \sqsubseteq A \\
\bullet & \text{otherwise}
\end{cases}$$

We further define the Alice-similarity property of two memories as follows:

**Definition 12.** We say two memories $M_1$ and $M_2$ are Alice-similar, denoted as $M_1 \sim_A M_2$, if and only if $M_1 \downarrow A = M_2 \downarrow A$.

Figure C.1 defines how $\text{sim}_A$ evaluate an expression. The judgement in the form of $l \vdash A \langle M, e \rangle \downarrow_t v$ says that given a memory $M$, the simulator $\text{sim}_A$ evaluates an expression $e$ to value $v$, producing memory trace $t$.

Figure C.2 and Figure C.3 defines how $\text{sim}_A$ simulates the instruction- and memory-traces until the next declassification. The judgement $\langle M_i, S_i \rangle \xrightarrow{(i,t)_A} \langle M'_i, S'_i \rangle$ says that given a statement $S_i$ and a memory $M_i$, $\text{sim}_A$ evaluates the program $S_i$ over memory $M_i$ and reduces to program $S'_i$ and memory $M'_i$ emitting Alice’s instruction trace $i$ and memory trace $t$.

The judgement $\langle M_i, S_i \rangle \xrightarrow{(i,t)_A} \langle M'_i, S'_i \rangle$ is similar to $\langle M_i, S_i \rangle \xrightarrow{(i,t)_A} \langle M'_i, S'_i \rangle$, but requires the last statement evaluated must be a declassification statement. We emphasize that our rules enforce that the memory over which the program is evaluated must be $\Gamma$-compatible.
Figure C.1: Operational semantics for $\text{sim}_A$

The simulator $\text{sim}_A(M, S, D_1, ..., D_n)$ runs as follows. Initially set $M_1$ to be $M$ and $S_1$ to be $S$. For each $i = 1, ..., n$, $\text{sim}_A$ evaluates $\langle M_i, S_i \rangle \xrightarrow{(i,t)} A \langle M'_i, S'_i \rangle$. If $D_i = \epsilon$, then set $M_{i+1}$ to be $M'_i$; otherwise, $D_i = (x, v)$, set $M_{i+1}$ to be $M'_i[x \mapsto v]$. Finally, $\text{sim}_A$ evaluates $\langle M_n, S_n \rangle \xrightarrow{(i,t)} A \langle M', S' \rangle$, and returns $(i, t)$.

The following lemma shows that the semantics for $\text{sim}_A$ generates the same memory trace as the semantics for SCVM.

**Lemma 20.** If $l \vdash \langle M, e \rangle \Downarrow_{t_a,t_b} v$ and $\Gamma \vdash e : \text{Nat} l'$ and $l \vdash_A \langle M', e \rangle \Downarrow_t v'$ and $M \sim_A M'$, and $l' \sqsubseteq l$, and $M$ and $M'$ are $\Gamma$-compatible, then $t_a \equiv t$ and if $l \sqsubseteq A$, then $v = v'$. Otherwise $v' = \bullet$.

**Proof.** Prove by structural induction on $e$. If $e = x$, then $\Gamma(x) = \text{Nat} l'$. If $l \sqsubseteq A$, then
\[
\langle M, S \rangle \xrightarrow{(i,t)^* \gamma} \langle M', S' \rangle
\]

**Sim-Declass**

\[
t = y \quad i = \text{declass}(x, y)
\]

\[
(M, 0 : x := \text{declass}(y)) \xrightarrow{(i,t)^* \gamma} (M, 0 : \text{skip})
\]

**Sim-Seq**

\[
(M, S_1) \xrightarrow{(i,t)^* \gamma} (M', S'_1)
\]

\[
(M, S_1; S_2) \xrightarrow{(i,t)^* \gamma} (M', S'_1; S_2)
\]

**Sim-Concat**

\[
(M', S') \xrightarrow{(i,t)^* \gamma} (M'', S'')
\]

\[
(M, S) \xrightarrow{(i,t)^* \gamma} (M'', S'')
\]

Figure C.2: Operational semantics for statements in sim \( A \) (part 1)

then \( v' = \text{val}(M'(x)) = \text{val}(M(x)) = v \), therefore \( v = v' \). Further \( t = \text{read}(x, v') = \text{read}(x, v) = t_a \) if \( l \sqsubseteq A \). If \( l = B \), then \( v' = \bullet \), and \( t = \epsilon = t_a \). If \( l = 0 \), then \( v' = \bullet \), and \( t = x = t_a \).

If \( e = n \), then \( t = \epsilon = t_a \), and \( v' = n = v \), and \( l = P \sqsubseteq A \).

If \( e = x_1 \text{ op } x_2 \). Then we know \( l \vdash_A \langle M', x_i \rangle \Downarrow_{t_i} v_i' \), and \( \langle M, x_i \rangle \Downarrow_{(t_i, t_i)} v_i \) for \( i = 1, 2 \). By induction assumption, we know \( t_i \equiv t_i^a \), and thus \( t = t_1 @ t_2 \equiv t_a^1 @ t_a^2 = t_a \). For its value, suppose \( \Gamma(x_i) = \text{Nat} \ l_i , i = 1, 2 \), if \( l \sqsubseteq A \), then \( l_i \sqsubseteq A \) holds true, and by induction assumption, we know \( v_i = v_i' \) for \( i = 1, 2 \), and thus \( v = v_1 \text{ op } v_2 = v_1' \text{ op } v_2 = v' \). Otherwise, either or both \( v_1 \) and \( v_2 \) are \( \bullet \), and thus we know \( v' = \bullet \).

If \( e = x[y] \). We first reason about the value. If \( l \sqsubseteq A \), then suppose \( \Gamma(y) = \text{Nat} \ l'' \), then \( l'' \sqsubseteq l' \sqsubseteq l \sqsubseteq A \) according to \( \Gamma \vdash x[y] : \text{Nat} \ l' \). Then we know \( v_1' = \text{val}(M'(y)) = \text{val}(M(y)) = v_1 \). Further, we know \( (m', l) = M'(x) = M(x) = (m, l) \), and thus \( v' = \text{get}(m', v_1') = \text{get}(m, v_1) = v \). If \( l \nsubseteq A \), then \( v = \bullet \).

239
Then we reason about the trace. If \( l \sqsubseteq A \), then

\[
t = \text{read}(y, v_1)@\text{readarr}(x, v_1, v) \equiv \text{read}(y, v'_1)@\text{readarr}(x, v_1, v') = t_a
\]

If \( l = B \), we have \( t \equiv \epsilon \equiv t_a \). If \( l = 0 \), we have \( t \equiv y@x \equiv t_a \).

For \( e = \text{mux}(x_1, x_2, x_3) \), based on a very similar argument as for \( x_1 \ op \ x_2 \), we can get the conclusion. \( \square \)

The following lemma further claims that if an expression has a type \( B \), then simulating it will generate no observable instruction traces and memory traces to Alice.

**Lemma 21.** If \( \Gamma \vdash e : \text{Nat} \ l' \), and \( B \vdash \langle M, e \rangle \Downarrow t \), and \( M \) is \( \Gamma \)-compatible then \( t \equiv \epsilon \).

**Proof.** Prove by structure induction on \( e \). If \( e = x \), then \( t = \epsilon \) by rule Sim-E-Var.

If \( e = x_1 \ op \ x_2 \). Suppose \( \Gamma \vdash x_i : \text{Nat} \ l_i \) for \( i = 1, 2 \), then we know \( l_i \sqsubseteq B \).

Therefore \( t_i \equiv \epsilon \), and thus \( t \equiv \epsilon \).

If \( e = x[y] \), the conclusion follows the fact that \( B \vdash \langle M, y \rangle \Downarrow \epsilon \), and

\[
\text{select}_A(B, \text{readarr}(x, v_1, v), x) = \epsilon.
\]

If \( e = \text{mux}(x_1, x_2, x_3) \), similar to binary operation, we know \( t \equiv \epsilon \). \( \square \)

**Lemma 22.** If \( B \vdash S \), and \( \langle M, S \rangle \xrightarrow{(i_a, t_a, i_b, t_b)} \langle M', S' \rangle \), where \( M \) is \( \Gamma \)-compatible, then \( i_a \equiv \epsilon, t_a \equiv \epsilon, \) and \( M \sim_A M' \)

**Proof.** We prove by induction on the structure of \( S \). If \( S = l : \text{skip} \), then the conclusion is trivial.
If $S = l : x := e$, then we know $l = B$ and $\Gamma(x) = \text{Nat } B$. Then $i_a = \epsilon$ and $M' \sim_A M$ follow trivially. According to Lemma 21, we can prove $t_a \equiv \epsilon$.

If $S = 0 : x := \text{oram}(y)$ or $S = 0 : x := \text{declass}_l(y)$, then $pc$ is required to be $P$, so that the conclusion is vacuous.

If $S = l : y[x_1] := x_2$, then $l = B$, and thus $i_a = \epsilon$. Therefore $t_a = t_{1a} \circ t_{2a} \circ t_a'$, where $B \vdash \langle M, x_i \rangle \psi_{(t_{i_1}, t_{i_2})} v_i$ for $i = 1, 2$, and $t_a' = \text{select}_A(B, \text{writearr}(y, v_1, v_2), y) = \epsilon$. Therefore $t_{1a} \equiv t_{2a} \equiv \epsilon$ according to Lemma 21. In conclusion, we have $t_a = t_{1a} \circ t_{2a} \circ t_a' \equiv \epsilon$. Finally, for memory, $M' = M[y \mapsto m'] \sim_A M$.

If $S = l : \text{if}(x)\text{then } S_1 \text{ else } S_2$, then $l = B$, and $\Gamma, B \vdash S_i$ for $i = 1, 2$. Suppose $M(x) = (v, B)$, then $\langle M, S \rangle \xrightarrow{(l, c, i_a, a, t_a')} \langle M, S_c \rangle$, where $v = 1 \Rightarrow c = 1$ and $v \neq 1 \Rightarrow c = 2$. There are two cases: (1) $M' = M$ and $S = S_c$, then the conclusion is trivial; (2) $\langle M, S_c \rangle \xrightarrow{(i_a', a, t_a')} \langle M', S' \rangle$. In this case, by induction assumption, we have $M' \sim_A M$, $i'' \equiv \epsilon$ and $t_a'' \equiv \epsilon$, so that $i_a = \epsilon \circ i_a'' \equiv \epsilon$ and $t_a = \epsilon \circ t_a'' \equiv \epsilon$.

For $S = l : \text{while}(x)\text{do } S'$, the conclusion can be proven similarly to the if-case.

Finally, for $S = S_1 ; S_2$, we know either (1) $\langle S_1, M \rangle \xrightarrow{(i_a, t_{a_1} \circ i_{b_1})} \langle S'_1, M' \rangle$; or (2) $\langle S_1, M \rangle \xrightarrow{(i_a, t_{a_1} \circ i_{b_1})} \langle l : \text{skip}, M'' \rangle$ and $\langle S_2, M'' \rangle \xrightarrow{(i_a, t_{a_2} \circ i_{b_2})} \langle S', M' \rangle$, where $i_a = i_a' \circ i_a''$, and $t_a = t_a' \circ t_a''$. In both cases, the conclusions can be proven easily.

The following lemma is the main lemma saying that when evaluating over Alice-similar memories, $\text{sim}_A$ and $\text{SCVM}$ will generate the same instruction traces and memory traces, and produce Alice-similar memory profiles.

**Lemma 23.** If $\langle M, S \rangle \xrightarrow{(i_a, t_{a_1} \circ i_{b_1})} \langle M_1, S' \rangle : D$ where $\Gamma, P \vdash S$, and $M \sim_A M'$,
and \( \langle M', S \rangle \xrightarrow{(i,t)}_A \langle M'_1, S'' \rangle \) (for \( D = \epsilon \)) or \( \langle M', S \rangle \xrightarrow{(i,t)}^*_A \langle M'_1, S'' \rangle \) (for \( D \neq \epsilon \)),
then \( S' = S'' \), \( i_a \equiv i \) and \( t_a \equiv t \). If \( D = \epsilon \), then \( M_1 \sim_A M'_1 \); otherwise, suppose \( D = (x,v) \), then \( M_1 = M'_1[x \mapsto v] \).

**Proof.** The conclusion \( S' = S'' \) can be trivially done by examining the correspondence of each E- and S- rules and Sim- rules. Therefore, we only prove (1) \( M_1 \sim_A M'_1 \), (2) \( i_a \equiv i \), and (3) \( t_a \equiv t \).

We prove by induction on the length of steps \( L \) toward generating declasification event \( D \). If \( L = 0 \), then we know \( S = \emptyset : x := \text{declass}_l(y); S_2 \) (or \( 0 : x := \text{declass}_l(y) \)). Since we assume \( \Gamma, \Pi \vdash S \), by typing rule T-Declass, we have \( l \neq 0 \), \( \Gamma(x) = \text{Nat} \ l \). If \( l \subseteq A \), then \( D[A] = (x,v) \), and thus \( M'_1[x \mapsto v] \sim_A M[x \mapsto v] = M_1 \). Further, we know \( i_a = \text{declass}(x,y) = i \), and \( t_a = y = t \). Second, if \( l_x = B \), then \( M'_1 = M' \sim_A M = M_1 \), \( i_a = \text{declass}(x,y) = i \), and \( t_a = y = t \).

We next consider \( L > 0 \), then \( S = S_1; S_2 \). Since \( (S_a; S_b; S_c) \) is equivalent to \( S_a; (S_b; S_c) \) in the sense that if \( \langle M,(S_a; S_b); S_c \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', S' \rangle : D \), then \( \langle M, S_a; (S_b; S_c) \rangle \xrightarrow{(i_a,i'_a,i_b,i'_b)} \langle M', S' \rangle : D \), where \( i_a \equiv i'_a \), \( i_b \equiv i'_b \), \( t_a \equiv t'_a \), and \( t_b \equiv t'_b \).

Therefore we only consider \( S_1 \) not to be a Seq statement, then we know \( S_1 = l : s_1 \). By taking one step, we only need to prove claims (1)-(3), then the conclusion can be shown by induction assumption. In the following, we consider how this step is executed.

**Case** \( l : \text{skip} \). If \( S_1 = l : \text{skip} \), the conclusion is trivial, i.e. \( i_a = \epsilon = i \) and \( t_a = \epsilon = t \) and \( M'_1 = M' \sim_A M = M_1 \).

**Case** \( l : x := e \). If \( S_1 = l : x := e, i_a = l : x := e = i \). Then we show \( t \equiv t_a \).
If \( l \subseteq A \), \( t \equiv t_a \) directly follows Lemma 20. If \( l = B \), then by Lemma 21, we have \( t \equiv e \equiv t_b \). If \( l = 0 \), then we consider \( e \) separately. If \( e = y \), then \( t = y@x = t_a \).

If \( e = y[z] \), then \( t = z@y@x = t_a \). If \( e = n \), then \( t = x = t_a \). If \( e = y \text{ op } z \), then \( t = y@z@x = t_a \). Finally, if \( e = \text{mux}(x_1, x_2, x_3) \), then \( t = x_1@x_2@x_3@x = t_a \).

Finally, we prove the memory equivalence. If \( l \subseteq A \), then according to Lemma 20, \( e \) evaluates to the same value \( v \) in the semantics, and in the simulator. Therefore \( M'_1 = M'[x \mapsto v] \sim_A M[x \mapsto v] = M_1 \). If \( B \subseteq l \), then \( M'_1 = M' \sim_A M \sim_A M[x \mapsto v] = M_1 \). Therefore, the conclusion is true.

Case \( 0 : x := \text{oram}(y) \). It is easy to see that \( M'_1 = M' \sim_A M \sim_A M[x \mapsto m] = M_1 \), and \( i = 0 : \text{init}(x, y) = i_a \). Suppose \( \Gamma(y) = \text{Nat } l \), then we know \( l \neq 0 \). If \( l \subseteq A \), then \( t = y@x = t_a \). Otherwise, \( l = B \), then we know \( t = x = t_a \).

Case \( l : y[x_1] := x_2 \). By typing rule T-ArrAss, we know \( \Gamma(y) = \text{Array } l \), \( \Gamma(x_1) = \text{Nat } l_1 \), \( \Gamma(x_2) = \text{Nat } l_2 \), where \( l_1 \subseteq A \) and \( l_2 \subseteq A \). If \( l \subseteq A \), then we have \( t_a = \text{read}(x_1, v_1)\text{read}(x_2, v_2)\text{writearr}(a, v_1, v_2) = t \), and \( i_a = l : y[x_1] := x_2 = i \).

For memory, \( M'' = M'[y \mapsto \text{set}(m, v_1, v_2)] \sim_A M[y \mapsto \text{set}(m, v_1, v_2)] = M^*, \) where \( (m, l) = M'(y) = M(y), (v_1, l_1) = M(x_1), \) and \( (v_2, l_2) = M(x_2) \).

If \( l = B \), then \( M'_1 = M' \sim_A M \sim_A M[y \mapsto m'] = M_1, i = \epsilon = i_a, t = \epsilon = t_a \).

Case \( l : \text{if}(x)\text{then } S_1\text{else } S_2 \). If \( l = B \), then according to Lemma 22, \( M'_1 = M' \sim_A M \sim_A M_1, t \equiv ct_a, \) and \( i \equiv ei_a \). If \( l \subseteq A \), then \( i = l : \text{if}(x) = i_a \), and \( t = \text{read}(x, v) = t_a \). Further, \( M'_1 = M' \sim_A M = M_1 \). Therefore, the conclusion is also true.

Case \( l : \text{while}(x)\text{do } S' \). For \( S_1 = \text{while}(x)\text{do } S_b \), the proof is very similar to the if-statement. \( \square \)
Lemma 23 immediately shows that sim\textsubscript{A} can simulate the correct traces. Therefore Theorem 3 holds true. Q.E.D

C.1 Proof of Theorem 4

Theorem 4 is a corollary of the following theorem:

**Theorem 8.** If \( \Gamma, pc \vdash S \), then either \( S \) is \( l : \texttt{skip} \), or for any \( \Gamma \)-compatible memory \( M \), there exist \( i_a, t_a, i_b, t_b, M', S' \), \( D \) such that \( \langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', S' \rangle : D \), \( M' \) is \( \Gamma \)-compatible, and \( \Gamma, pc \vdash S' \).

**Proof.** We prove by induction on the structure of \( S \). If \( S = l : \texttt{skip} \), then the conclusion is trivial.

If \( S = l : x := e \), then \( \Gamma(x) = \text{Nat} \ l \), \( \Gamma \vdash e : \text{Nat} \ l' \), \( pc \sqcup l' \subseteq l \). We discuss the type of \( e \). If \( e = x' \), then we know \( \Gamma(x') = \text{Nat} \ l' \). Since \( M \) is \( \Gamma \)-compatible, we know \( M(x') = (v, l') \), where \( v \in \text{Nat} \). Therefore, \( \langle M, x' \rangle \downarrow_{(t_a, t_b)} v \), where \( (t_a, t_b) = \text{select}(l, \text{read}(x', v), x') \), and thus \( \langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', l : \texttt{skip} \rangle : \epsilon \), where \( (i_a, i_b) = \text{inst}(l, x := e) \), \( t'_a = t_a@t''_a \), and \( t'_b = t_b@t''_b \), where \( (t''_a, t''_b) = \text{select}(l, \text{write}(x, v), x) \).

Further, \( M' = M[x \mapsto (v, l)] \). Therefore, \( M' \) is also \( \Gamma \)-compatible, and the conclusion holds true. Similarly, we can prove that if \( \Gamma \vdash e : \tau \) is derived using T-Const, T-Op, T-Array, or T-Mux, then the conclusion is also true.

If \( S = 0 : x := \texttt{declass}_l(y) \), then \( \Gamma(y) = \text{Nat} \ O \), \( \Gamma(x) = \text{Nat} \ l \) where \( l \neq O \), and \( pc = P \). Since \( M \) is \( \Gamma \)-compatible, we know \( M(y) = (v, O) \), \( M' = M[x \mapsto (v, l)] \). Therefore \( M' \) is \( \Gamma \)-compatible. Further, \( t_a = y = t_b \), \( i_a = i_b = 0 : \texttt{declass}(x, y) \), \( D = \texttt{select}(l, (x, v), \epsilon) \), and \( \langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', 0 : \texttt{skip} \rangle \), and we
know that \( \Gamma, P \vdash O : \text{skip} \). Therefore the conclusion is true.

Similarly, we can prove the conclusion is true for \( S = O : x := \text{oram}(y) \).

For \( S = l : y[x_1] := x_2 \), then \( \Gamma(y) = \text{Array} l \), \( \Gamma(x_1) = \text{Nat} l_1 \), \( \Gamma(x_2) = \text{Nat} l_2 \), and \( pc \sqcup l_1 \sqcup l_2 \sqsubseteq l \). Since \( M \) is \( \Gamma \)-compatible, we know \( M(y) = (m, l) \), \( M(x_1) = (v_1, l_1) \), and \( M(x_2) = (v_2, l_2) \). Therefore \( M' = M[y \mapsto \text{set}(m, v_1, v_2), l] \) is also \( \Gamma \)-compatible. Further, \( (t'_a, t'_b) = \text{select}(l, \text{writearr}(y, v_1, v_2), y) \), \( t_a = t_{1a} @ t_{2a} @ t'_a \), \( t_b = t_{1b} @ t_{2b} @ t'_b \), and \( (i_a, i_b) = \text{inst}(l, y[x_1] := x_2) \). Therefore, \( \langle M, S \rangle \xrightarrow{(i_a, t_a, i_b, t_b)} \langle M', l : \text{skip} \rangle \), where we can prove \( \Gamma, pc \vdash l : \text{skip} \) easily. Therefore, the conclusion is true.

For \( S = l : \text{if}(x)\text{then} S_1 \ \text{else} \ S_2 \), we have \( \Gamma(x) = \text{Nat} l \). Therefore \( M(x) = (v, l) \). If \( v = 1 \), then \( \langle M, S \rangle \xrightarrow{i_a, t_a, i_b, t_b} \langle M, S_1 \rangle \) where \( (i_a, i_b) = \text{inst}(l, \text{if}(x)) \) and \( (t_a, t_b) = \text{select}(l, \text{read}(x, v), x) \). Further, we know \( \Gamma, l \vdash S_1 \). Since \( pc \sqsubseteq l \), it is easy to prove by induction that \( \Gamma, pc \vdash S_1 \) is true as well. Therefore, the conclusion is true. On the other hand, if \( v \neq 1 \), then \( \langle M, S \rangle \xrightarrow{i_a, t_a, i_b, t_b} \langle M, S_2 \rangle \). We can also prove the conclusion.

The proof for \( S = l : \text{while}(x)\text{do} \ S' \) is similar to the branching-statement by using rule S-While-True and S-While-False.

For \( S = S_1 ; S_2 \), then we know \( \Gamma \vdash S_1 \). The conclusion directly follows the induction assumption by applying rule S-Seq and rule S-Skip. \( \square \)
Appendix D: The hybrid protocol and the proof of Theorem 5

In this section, we present the hybrid protocol, and show it emulates the ideal world functionality $\mathcal{F}$. To start with, we present smaller ideal functionalities in $\mathcal{G}$ used by the hybrid world protocol.

1. $\mathcal{F}_{op}^{(l_1,l_2)}$ are the ideal functionalities for binary operation $op$. They are parameterized by two type labels $l_1$ and $l_2$ from $\{P, A, B, O\}$ indicating which party provides the data to the functionality. Suppose the operation is $x op y$. $l_1$ and $l_2$ correspond to $x$ and $y$ respectively. If $l_1$ is $P$, then both Alice and Bob will hand in the value of $x$, and the functionality verifies these two values are the same. If $l_1$ is $A$ (or $B$), then Alice (or Bob) hands in the value of $x$ to the functionality. If $l_1$ is $O$, then both Alice and Bob hand in their secret shares to the functionality respectively. The value of $l_2$ has the same meaning but is for the data source of $y$. These ideal functionalities output secret shares of the result to Alice and Bob respectively. For example, $\mathcal{F}_{op}^{(P,A)}$ accepts input $x, y$ from Alice, and $x$ from Bob and return the results $[v]_a$ to Alice and $[v]_b$ to Bob. We denote this as $([v]_a, [v]_b) = \mathcal{F}_{op}^{(P,A)}(x@y, x)$.

2. $\mathcal{F}_{mux}^{(l_1,l_2,l_3)}$ are the ideal functionalities for the multiplex operations. The three parameters $l_1$, $l_2$, and $l_3$ have the same meaning as above, but correspond to
the three input of the multiplex operation. These functionalities also return secret shares to Alice and Bob.

3. $F_{\text{oram}}^x$ for each array $x$ is an interactive Oblivious RAM functionality. It supports three operations.

   - **init**: to initialize the ORAM with a given array. $l$ is from $\{P, A, B\}$. If $l$ is $P$ or $A$, then Alice hands in her array. If $l$ is $B$, then Bob hands in his array.

   - **read**: to read the content for a given index. The index is provided as a pair of secret shares from Alice and Bob. The output is also a pair of secret shares, which are returned to Alice and Bob respectively.

   - **write**: to write a value into a given index. It takes four inputs: the secret shares of the index and the secret shares of the values from Alice and Bob respectively.

4. $F_{\text{declass}}^l$ is the declassification function, which takes secret shares from Alice and Bob as its input, and returns the revealed value to the party corresponding to $l$.

The protocol $\Pi^G$ is then presented in Figure D.1 and Figure D.2. During the protocol’s execution, Alice and Bob consumes their instruction traces and memory traces. Since the memory traces contain all information of the public memory and their local memories, both Alice and Bob only store locally their secret shares $[M]_A$ and $[M]_B$ and the instruction- and memory- traces.
Figure D.1 presents the rules for local execution. Since all local and public data to be used in secure computation are contained in memory traces, Alice and Bob do not maintain their local data and public data. The rules are in the form of \((i,t) \rightarrow (\epsilon,\epsilon)\), which means the instruction trace \(i\) and memory trace \(t\) will be consumed. In each rule, only one local instruction, i.e. the security label \(l \neq \emptyset\), and its corresponding memory trace for each instruction will be consumed. It is not hard to verify the following proposition:

**Proposition 1.** Assuming \(\langle M, S \rangle \xrightarrow{(i_a,t_a,i_b,t_b)} \langle M', S' \rangle : \epsilon \) and \(s\) is a statement in the set \(\{x := e, x[x] := x, \text{if}(x), \text{while}(x)\}\). If \(i_a = l : s\), where \(l \neq \emptyset\), then \((i_a,t_a) \rightarrow (\epsilon,\epsilon)\); if \(i_b = l : s\), where \(l \neq \emptyset\), then \((i_b,t_b) \rightarrow (\epsilon,\epsilon)\).

Note local execution rules only handle executing one instruction. The sequence of multiple instructions are handled using rule H-LocalA, H-LocalB, and H-Seq explained later.

Figure D.2 presents two parts. The first part consists the rules, in the form of \(\langle [M]_A, t_a, [M]_B, t_b, e \rangle \Downarrow ([v]_a, [v]_b)\), which securely evaluate an expression \(e\). \([M]_A\) and \([M]_B\) are the mapping from variables to their secret shares, and \(t_a\) and \(t_b\) are memory traces from Alice and Bob respectively. All information to evaluate \(e\) are contained in \([M]_A\), \([M]_B\), \(t_a\), and \(t_b\). The result is in the format of \(([v]_a, [v]_b)\), where \([v]_a\) and \([v]_b\) are secret shares of the result for Alice and Bob respectively. The rules restrict that \(t_a\) and \(t_b\) must be the memory traces generated by the ideal functionality \(\mathcal{F}\) when evaluating \(e\).
Rule SE-Const deals with constant expression $n$. $([v]_a, [v]_b)$ can be acquired by secret-sharing $n$, which is implemented using $\mathcal{F}_{\pm}^{(A,B)}(n, 0)$. Rule SE-Var secret shares the value of a variable expression $x$. The value of $x$ can be extracted from $[M]_A$ and $[M]_B$, $t_a$, or $t_b$ according to $\Gamma(x)$. If $\Gamma(x)$ is $P$ or $A$, then $t_a = \text{read}(x, v)$, and $[v]_a$ and $[v]_b$ can be computed using $\mathcal{F}_{\pm}^{(A,B)}(v, 0)$. If $\Gamma(x)$ is $B$, the computation is similar, but Bob hands in his value $v$. If $\Gamma(x)$ is $O$, then $[M]_A(x)$ and $[M]_B(x)$ can be directly returned.

Rule SE-OP handles a binary operation $x \text{ op } y$. It can be directly computed using a binary operation functionality $\mathcal{F}_{\text{op}}^{\Gamma(x), \Gamma(y)}$. The input to the functionality is $[M]_A(t_a)$ and $[M]_B(t_b)$, which is defined as follows. Suppose $[M]$ is a mapping from variables to secret shares, and $t$ is a trace. Then $[M](t)$ is defined inductively as

$$
[M]\langle \text{read}(x, v) \rangle = v \quad [M]\langle x \rangle = [M](x) \quad [M]\langle t_1 \circ t_2 \rangle = [M]\langle t_1 \rangle @ [M]\langle t_2 \rangle
$$

Notice that $[M](t)$ is defined over only $\text{read}(x, v)$, $x$, and concatenations of them. This is because this notion is used for binary operation and multiplex, where array read events and write events do not occur. The rule SE-MUX for multiplex operation is similar.

For array expression $y[x]$, there are two rules, SE-ArrVar and SE-L-ArrVar. If $\Gamma(y) = 0$, then evaluating $y[x]$ is an ORAM read operation. Rule SE-ArrVar calls the ORAM functionality $\mathcal{F}_{\text{oram}}^{y}$ to get the secret shares $([v]_a, [v]_b)$. Otherwise, $y[x]$ can be computed locally, and rule SE-L-ArrVar handles this case.
The second part of the rules (Figure D.2) are for hybrid protocol, which are in the form of \( \langle [M]_A, i_a, t_a \rangle, \langle [M]_B, i_b, t_b \rangle \rightarrow \langle [M']_A, i'_a, t'_a \rangle, \langle [M']_B, i'_b, t'_b \rangle : D \), meaning that Alice and Bob keeping their shares of secret variables, i.e. \([M]_A \) and \([M]_B \) respectively, execute over their simulated traces, i.e. \( i_a \) and \( t_a \) for Alice, and \( i_b \) and \( t_b \) for Bob, evaluates to new shares \([M']_A \) and \([M']_B \), and new traces \( i'_a \), \( t'_a \), \( i'_b \) and \( t'_b \), and generate declassification \( D \), which is either \( \epsilon \) or \((d_a, d_b)\).

Rule H-Assign deals with the instruction \( 0 : x := e \). The trace must be in the format of \( t_a = t'_a@x \) and \( t_b = t'_b@x \), where \((t'_a, t'_b)\) are the memory traces for Alice and Bob to evaluate \( e \). This rule first evaluates the expression \( e \) to get \([v]_a\) and \([v]_b\). Then it substitute the mapping for \( x \) in \([M]_A \) and \([M]_B \) accordingly.

Rule H-ORAM handles ORAM initialization instruction \( 0 : \text{init}(x, y) \). Either of Alice’s or Bob’s memory trace must be \( \text{readarr}(y, 0, m(0))@...@\text{readarr}(y, l, m(l))@x \), where \( l = |m| - 1 \). From this trace, one party is able to reconstruct the memory \( m \), which is later fed into ORAM functionality \( F^x_{\text{oram}} \) to initialize it.

Rule H-ArrAss handles the instruction \( 0 : y[x_1] := x_2 \). First, the secret shares for evaluating \( x_i \) are \([v]_{ia}\) and \([v]_{ib}\) for \( i = 1, 2 \) respectively. Then they are fed into the ORAM functionality \( F^y_{\text{oram}} \) to perform a write operation.

Rule H-Cond-While handles \( 0 : \text{if}(x) \) and \( 0 : \text{while}(x) \), which only consumes the corresponding memory traces \( x \), and does not modify \([M]_A \) and \([M]_B \).

Rule H-Declass handles the instruction \( 0 : \text{declass}(x, y) \), which is the only instruction generating non-empty declassification. According to rule S-Declass, both memory traces are \( y \). It calls the declassification functionality \( F^{\Gamma(x)}_{\text{declass}}([M]_A(y), [M]_B(y)) \) to release the value of \( v \) to the party corresponding to \( \Gamma(x) \).
The rules discuss above handles only one instructions. There is a proposition similar to Proposition 1 that holds true for hybrid rules. We start by introducing the concept of consistency of secret-sharing mapping with a memory:

**Definition 13.** Given a type environment $G$, we say a pair of secret share mappings $[M]_A$ and $[M]_B$ is consistent with a $G$-compatible memory $M$ if and only if for all $x$ such that $\Gamma(x) = \emptyset$, $M(x) = \mathcal{F}_{\text{declass}}([M]_A(x), [M]_B(x))$.

Now we are ready to present the following proposition.

**Proposition 2.** Assuming $\langle M, P \rangle \xrightarrow{(i_a, t_a, i_b, t_b)} \langle M', P' \rangle : \epsilon$. We use the notation $s$ to denote one element of the set $\{x := e, x[x] := x, \text{if}(x), \text{while}(x), \text{init}(x, y), \text{declass}(x, y)\}$. If $i_a = i_b = \emptyset : s$, and $[M]_A$ and $[M]_B$ are consistent with $M$, then

$$\langle [M]_A, i_a, t_a \rangle, \langle [M]_B, i_b, t_b \rangle \leadsto \langle [M']_A, i'_a, t'_a \rangle, \langle [M']_B, i'_b, t'_b \rangle : D,$$

and $[M']_A$ and $[M']_B$ are consistent with $M'$.

The rest four rules deal with multiple instructions. H-Seq and H-Concat are similar to S-Seq and S-Concat correspondingly. H-LocalA and H-LocalB are used to execute local and public instructions.

We first show the hybrid protocol $\pi^G$ generates the same declassification events. This can be easily proved by induction leveraging Proposition 2.

We then show that the hybrid protocol $\pi^G$ securely emulates the ideal world functionality $\mathcal{F}$ (Theorem 5). We suppose Alice is the semi-honest adversary, and Bob’s case is symmetric. To show this, the adversary of $\pi^G$ can learn $i_a, t_a$, a sequence
of secret share mappings $[M]_A, [M']_A, ..., $ and declassification events $D^1_A, D^2_A, ...$. In the ideal world, and adversary can learn all the declassification events $D^1_A, ..., $ and it can simulate to get $i_a$ and $t_a$. Further the secret share mappings $[M]_A, [M']_A, ...$ are indistinguishable to random bits. Therefore, the adversary in real world can securely simulates the hybrid world’s adversary.
Figure C.3: Operational semantics for statements in $\text{sim}_A$ (part 2)
\[
\langle [M]_A, t_a, [M]_B, t_b, e \rangle \Downarrow ([v]_a, [v]_b)
\]

**SE-Const**

\[
([v]_a, [v]_b) = \mathcal{F}^{(A,B)}_+ (n, 0)
\]

\[
\langle [M]_A, e, [M]_B, e, n \rangle \Downarrow ([v]_a, [v]_b)
\]

**SE-Op**

\[
([v]_a, [v]_b) = \mathcal{F}^{(\Gamma(x), \Gamma(y))}_{\text{op}} ([M]_A(t_a), [M]_B(t_b))
\]

\[
\langle [M]_A, t_a, [M]_B, t_b, x \text{ op } y \rangle \Downarrow ([v]_a, [v]_b)
\]

\[
(t_a, t_b) = \text{select}(\Gamma(x), \text{read}(x, v), x)
\]

\[
\Gamma(x) \sqsubseteq A \Rightarrow ([v]_a, [v]_b) = \mathcal{F}^{(A,B)}_+ (v, 0)
\]

\[
\Gamma(x) = B \Rightarrow ([v]_a, [v]_b) = \mathcal{F}^{(A,B)}_+ (0, v)
\]

**SE-Var**

\[
\Gamma(x) = 0 \Rightarrow ([v]_a, [v]_b) = ([M]_A(x), [M]_B(x))
\]

\[
\langle [M]_A, t_a, [M]_B, t_b, x \rangle \Downarrow ([v]_a, [v]_b)
\]

**SE-Mux**

\[
([v]_a, [v]_b) = \mathcal{F}^{(\Gamma(x), \Gamma(y), \Gamma(z))}_{\text{mux}} ([M]_A(t_a), [M]_B(t_b))
\]

\[
\langle [M]_A, t_a, [M]_B, t_b, \text{mux}(x, y, z) \rangle \Downarrow ([v]_a, [v]_b)
\]

\[
t_a = t'_a \oplus y \\
(t_b = t'_b \oplus y)
\]

\[
([v]_a, [v]_b) = \mathcal{F}^{\oplus}_{\text{op}} (\text{read}, [v']_a, [v']_b)
\]

\[
\langle [M]_A, t_a, [M]_B, t_b, y[x] \rangle \Downarrow ([v]_a, [v]_b)
\]

**SE-ArrVar**

\[
t_a = t'_a \oplus t''_a \\
t_b = t'_b \oplus t''_b \\
\Gamma(y) \neq \emptyset \\
(t_a', t_b') = \text{select}(\Gamma(y), \text{readarr}(y, v_1, v), y)
\]

\[
\Gamma(y) \sqsubseteq A \Rightarrow ([v]_a, [v]_b) = \mathcal{F}^{(A,B)}_+ (v, 0)
\]

**SE-L-ArrVar**

\[
\langle [M]_A, t_a, [M]_B, t_b, y[x] \rangle \Downarrow ([v]_a, [v]_b)
\]

\[
\Gamma(y) = B \Rightarrow ([v]_a, [v]_b) = \mathcal{F}^{(A,B)}_+ (0, v)
\]

Figure D.1: Hybrid Protocol $\pi^G$ (Part I)
\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M']_A, \epsilon, \epsilon \rangle, \langle [M']_B, \epsilon, \epsilon \rangle : D
\]

\[
i = 0 : \text{init}(x, y) \quad t_a = t'_a \oplus x \quad t_b = t'_b \oplus x
\]

\[
(t'_a, t'_b) = \text{select}(\Gamma(y), \text{arr}(y, m))
\]

\[
\mathcal{F}_{\text{oram}}(\text{init}(y), m)
\]

\[
\text{H-ORAM}
\]

\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M]_A, \epsilon, \epsilon \rangle, \langle [M]_B, \epsilon, \epsilon \rangle : \epsilon
\]

\[
i = 0 : x := e \quad \langle [M]_A, t'_a, [M]_B, t'_b, e \rangle \downarrow ([v]_a, [v]_b)
\]

\[
t_a = t'_a \oplus x \quad t_b = t'_b \oplus x
\]

\[
\]

\[
\text{H-Assign}
\]

\[
i = 0 : y[x_1] := x_2 \quad t_a = t'_a \oplus y \quad t_b = t'_b \oplus y
\]

\[
(t_{ia}, t_{ib}) = \text{select}(\Gamma(x_i), \text{read}(x_i, v_i), x_i)
\]

\[
\langle [M]_A, t'_{ia}, [M]_B, t'_{ib}, x_i \rangle \downarrow ([v]_{ia}, [v]_{ib})
\]

\[
i = 1, 2 \quad \mathcal{F}_{\text{oram}}(\text{write}, [v]_{ia}, [v]_b, [v]_{ib}, [v]_{2a}, [v]_{2b})
\]

\[
\text{H-ArrAss}
\]

\[
i = 0 : \text{if}(x) \quad \text{or} \quad i = 0 : \text{while}(x) \quad t_a = t_b = x
\]

\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M]_A, \epsilon, \epsilon \rangle, \langle [M]_B, \epsilon, \epsilon \rangle : \epsilon
\]

\[
i = 0 : \text{declass}(x, y) \quad t_a = t_b = y
\]

\[
v = \mathcal{F}_{\text{declass}}([M]_A(y), [M]_B(y))
\]

\[
D = \text{select}(\Gamma(x), (x, v), (x, v))
\]

\[
\text{H-Declass}
\]

\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M]_A, \epsilon, \epsilon \rangle, \langle [M]_B, \epsilon, \epsilon \rangle : D
\]

\[
\langle [M]_A, i'_{a}, t'_{a} \rangle, \langle [M]_B, i'_{b}, t'_{b} \rangle \leadsto \langle [M']_A, \epsilon, \epsilon \rangle, \langle [M'], \epsilon, \epsilon \rangle : D
\]

\[
i'_{a} = i'_{a} @ i''_{a} \quad t'_a = t''_{a}
\]

\[
i'_{b} = i'_{b} @ i''_{b} \quad t'_b = t''_{b}
\]

\[
\text{H-Seq}
\]

\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M']_A, i'_{a}, t'_{a} \rangle, \langle [M']_B, t'_{b} \rangle : D
\]

\[
\langle [M']_A, i'_{a}, t'_{a} \rangle, \langle [M']_B, t'_{b} \rangle \leadsto \langle [M''_A], i''_{a}, t''_{a} \rangle, \langle [M''_B], t''_{b} \rangle : D
\]

\[
\text{H-Concat}
\]

\[
(i, t) \rightarrow i_a = i @ i'_{a} \quad t_a = t @ t'_{a}
\]

\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M]_A, i'_{a}, t'_{a} \rangle, \langle [M]_B, i, t_b \rangle : \epsilon
\]

\[
(i, t) \rightarrow i_b = i @ i'_{b} \quad t_b = t @ t'_{b}
\]

\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle : \epsilon
\]

\[
\text{H-LocalA}
\]

\[
(i, t) \rightarrow i_b = i @ i'_{b} \quad t_b = t @ t'_{b}
\]

\[
\langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle \leadsto \langle [M]_A, i, t_a \rangle, \langle [M]_B, i, t_b \rangle : \epsilon
\]

\[
\text{H-LocalB}
\]

Figure D.2: Hybrid Protocol $\Pi^2(\text{PartIII})$
Bibliography


[57] Chang Liu, Austin Harris, Martin Maas, Michael Hicks, Mohit Tiwari, and Elaine Shi. Ghostrider: A hardware-software system for memory trace oblivious computation. In ASPLOS, 2015.


