Political Artifacts and Personal Privacy: The Yenta Multi-Agent Distributed Matchmaking System by Leonard Newton Foner The following people served as readers for this thesis: Reader Peter G. Neumann Principal Scientist Computer Science Lab SRI International Reader Deborah Hurley Director, Harvard Information Infrastructure Project Kennedy School of Government Harvard University Reader Henry Jenkins Professor of Literature Director, Film and Media Studies Program MIT Literature Department Acknowledgments This work could never have happened without the support and assistance of many people. First and foremost, I thank my advisor, Pattie Maes, for her invaluable advice and encouragement in the years we have worked together. I also thank the rest of my committee -- Peter Neumann, Deborah Hurley, and Henry Jenkins -- for their attention and advice. I am forever grateful to Lisa Kamm for her unflagging friendship and support, and for her invaluable legal and political acumen. I am also deeply indebted to Michele Evard for her friendship and encouragement, and for helping to pass on the oral tradition that is so much a part of a Media Lab dissertation. A large number of people contributed in one way or another to the development of Yenta and its ancillary systems. I thank Brad Rhodes for his friendship, for important feedback on certain aspects of Yenta's design, and for his development of the Remembrance Agent, whose document comparison engine has been passed back and forth, rewritten, and rearranged innumerable times between us and among several of our UROPs, whom I also thank. Barry Crabtree, of British Telecom, was enthusiastic about Yenta from the beginning, not only contributing to an early prototype, but also in arranging for gorgeous animations from simulations of Yenta's network behavior. Undergraduates, as part of MIT's UROP program, contribute mightily to many research projects and help make MIT what it is. Daniel Barkalow and Aaron Ucko have spent untold hours doing first-rate work on Yenta's code. Without their help, Yenta may never have been finished. They have my highest commendation and my most heartfelt thanks. In addition, Sofya Raskhodnikova, Edward Kogan, Bayard Wenzel, Aditya Prabhakar, and Katie King have made important contributions to one part or another of Yenta. I thank also Abhay Saxena, Peter Davis, Brian Sniffen, and Pamela Mukerji. Tomoko Akiba created Yenta's wonderful surrealistic icons, and Maggie Oh made its logo. Ray Lee wrote an excellent original prototype for Yvette, and Ivan Nestlerode upgraded and polished it until it was ready for prime time. I also thank the authors of SSLeay, SCM, autoconf, automake, and gcc, without which this project could not even have been contemplated. Finally, I would like to thank the many people not already mentioned above who have reviewed copies of this manuscript and provided comments on it, including David Anderson, Judy Anderson, Marlena Erdos, and David Bridgham. Political Artifacts and Personal Privacy: The Yenta Multi-Agent Distributed Matchmaking System by Leonard Newton Foner SB Electrical Engineering and Computer Science Massachusetts Institute of Technology June 1986 SM Media Arts and Sciences Massachusetts Institute of Technology June 1994 Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in Partial Fulfillment of the requirements of the degree of DOCTOR OF PHILOSOPHY at the Massachusetts Institute of Technology June 1999 © Massachusetts Institute of Technology, 1999 All Rights Reserved Signature of Author Program in Media Arts and Sciences April 30, 1999 Certified By Pattie Maes Associate Professor of Media Arts and Sciences Program in Media Arts and Sciences Accepted by Stephen A. Benton Chairperson Departmental Committee on Graduate Students Program in Media Arts and Sciences Political Artifacts and Personal Privacy: The Yenta Multi-Agent Distributed Matchmaking System by Leonard Newton Foner Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on April 30, 1999 in Partial Fulfillment of the requirements of the degree of DOCTOR OF PHILOSOPHY at the Massachusetts Institute of Technology Abstract Technology does not exist in a social vacuum. The design and patterns of use of any par- ticular technological artifact have implications both for the direct users of the technology, and for society at large. Decisions made by technology designers and implementors thus have political implications that are often ignored. If these implications are not made a part of the design process, the resulting effects on society can be quite undesirable. The research advanced here therefore begins with a political decision: It is almost always a greater social good to protect personal information against unauthorized disclosure than it is to allow such disclosure. This decision is expressly in conflict with those of many busi- nesses and government entities. Starting from this premise, a multi-agent architecture was designed that uses both strong cryptography and decentralization to enable a broad class of Internet-based software applications to handle personal information in a way that is highly resistant to disclosure. Further, the design is robust in ways that can enable users to trust it more easily: They can trust it to keep private information private, and they can trust that no single entity can take the system away from them. Thus, by starting with the explicit political goal of encouraging well-placed user trust, the research described here not only makes its social choices clear, it also demonstrates certain technical advantages over more traditional approaches. We discuss the political and technical background of this research, and explain what sorts of applications are enabled by the multi-agent architecture proposed. We then describe a representative example of this architecture---the Yenta matchmaking system. Yenta uses the coordinated interaction of large numbers of agents to form coalitions of users across the Internet who share common interests, and then enables both one-to-one and group con- versations among them. It does so with a high degree of privacy, security, and robustness, without requiring its users to place unwarranted trust in any single point in the system. Thesis Supervisor: Pattie Maes Title: Associate Professor, Program in Media Arts and Sciences This work was supported in part by British Telecom and Telecom Italia. Table of Contents Chapter 1: Introduction 15 1.1 The fundamental premise 15 1.2 What's ahead? 16 1.3 What are we protecting? 16 1.4 The right to privacy 19 1.5 The problems with centralized solutions 22 1.6 Advantages of a decentralized solution 24 1.7 A brief summary of this research 25 1.7.1 The architecture and its sample application 26 1.7.2 Evaluation 26 1.8 Summary 27 Chapter 2: System Architecture 29 2.1 Introduction 29 2.2 Application traits 30 2.3 Application traits we are not considering 31 2.4 Yenta -- the sample application 32 2.5 The overall architecture 33 2.6 Determining one user's characteristics 33 2.7 Bootstrapping 34 2.8 Forming groups of users -- clustering 35 2.8.1 Data structures used in finding referrals and clusters 35 2.8.2 Referrals and clustering 35 2.8.3 Privacy of the information exchanged 38 2.9 What exactly is a cluster? 39 2.10 Using the resulting clusters 41 2.10.1 One-to-one communication 41 2.10.2 Broadcasting to all agents in a cluster 41 2.10.3 Hiding identities 42 2.11 Reputations 43 2.12 Running multiple agents on one host 44 2.13 Evaluation hooks 46 2.14 Summary 48 Chapter 3: Privacy and Security 49 3.1 Introduction 49 3.2 The problem 49 3.2.1 The threat model: what attacks may we expect? 49 3.2.2 How private is private? 51 3.2.3 Security design desiderata 51 3.2.4 Problems not addressed 53 3.3 Cryptographic techniques 54 3.3.1 Symmetric encryption 54 3.3.2 Public-key encryption 54 3.3.3 Cryptographic hashes 55 3.3.4 Key distribution 55 3.4 Structure of the solutions 56 3.4.1 The nature of identity 56 3.4.2 Eavesdropping 57 3.4.3 Malicious agents 57 3.4.4 Protecting the distribution 57 3.5 Selected additional topics 59 3.6 Summary 60 Chapter 4: The Sample Application: Yenta 63 4.1 Introduction 63 4.2 Yenta's purpose 63 4.3 Sample scenarios 63 4.4 Affordances 64 4.4.1 User interface 64 4.4.2 Yenta runs forever 64 4.4.3 Handles 65 4.4.4 Determining user interests 65 4.4.5 Messaging 66 4.4.6 Introductions 67 4.4.7 Reputations 67 4.4.8 Bookmarks 67 4.4.9 News 67 4.4.10 Help 68 4.4.11 Configuration 68 4.4.12 Other operations 68 4.5 Politics 68 4.6 Implementation details 69 4.6.1 The C code 69 4.6.2 The Scheme code 70 4.6.3 Dumping 71 4.6.4 Architectures 71 4.7 Determining user interests 71 4.7.1 Producing word vectors 71 4.7.2 Clustering 72 4.8 Security considerations 73 4.8.1 Encrypting connections 73 4.8.2 Protecting persistent state 73 4.8.3 Random numbers 77 4.9 Summary 77 Chapter 5: Evaluation 85 5.1 Introduction 85 5.2 Simulation results 86 5.3 Collecting data from Yenta 87 5.4 What data is collected? 89 5.5 A sample of results 90 5.5.1 Qualitative results 91 5.5.2 Quantitative results 92 5.6 Security 93 5.7 Risk analysis 96 5.7.1 Denial of service 97 5.7.2 Integrity and confidentiality -- protocols 98 5.7.3 Integrity and confidentiality -- spies 99 5.7.4 Contagion 99 5.7.5 Central servers 100 5.7.6 Nontechnical risks 101 5.8 Other applications of this architecture 101 5.9 Motivating adoption of the technology 104 5.10 Future work 106 5.10.1 Sociological study 106 5.10.2 Political evaluation 106 5.11 Summary 106 Chapter 6: Related Work 109 6.1 Introduction 109 6.2 Matchmakers 109 6.3 Decentralized systems 111 6.4 Political software and systems 112 6.5 Summary 114 Chapter 7: Conclusions 117 References 119 List of Figures Figure 1: Yentas talk to each other and to their users' web browsers 33 Figure 2: Clusters and overlaps 39 Figure 3: Degrees of anonymity 51 Figure 4: Showing the user how to submit an evaluation. 61 Figure 5: A typical evaluation. The small bars on the left of each source line are color-coded. 61 Figure 6: A sampling of interests. Real users tend to have many more than shown here. 79 Figure 7: Recent messages received by this Yenta, and options for dealing with them. 79 Figure 8: A typical message, and how to reply. 80 Figure 9: Replying to a message. 80 Figure 10: Manipulating attestations. 81 Figure 11: Recent news about this particular Yenta. 81 Figure 12: A sampling of the help. 82 Figure 13: Adjusting internal parameters, for those who demand knobs.. 82 Figure 14: Some infrequently-used operations 83 Figure 15: If Yenta is manually shut down, this is the last page it shows. 83 Figure 16: Some selected statistics from fielded Yentas. 93 Figure 17: Simulation results. See text for details. 107 CHAPTER 1 Introduction 1.1 The fundamental premise Technology does not exist in a social vacuum. The design and patterns of use of any particular technological artifact have implications both for the direct users of the tech- nology, and for society at large. Decisions made by technology designers and imple- mentors thus have political implications that are often ignored. If these implications are not made a part of the design process, the resulting effects on society can be quite undesirable. The research advanced here therefore begins with a political decision: It is almost always a greater social good to protect personal information against unauthorized dis- closure than it is to allow such disclosure. This decision is expressly in conflict with those of many businesses and government entities. Starting from this premise, a multi-agent architecture was designed that uses both strong cryptography and decen- tralization to enable a broad class of Internet-based software applications to handle personal information in a way that is highly resistant to disclosure. Further, the design is robust in ways that can enable users to trust it more easily: They can trust it to keep private information private, and they can trust that no single entity can take the system away from them. Thus, by starting with the explicit political goal of encouraging well-placed user trust, the research described here not only makes its social choices clear, it also demonstrates certain technical advantages over more traditional approaches. We discuss the political and technical background of this research, and explain what sorts of applications are enabled by the multi-agent architecture proposed. We then describe a representative example of this architecture---the Yenta matchmaking sys- tem. Yenta uses the coordinated interaction of large numbers of agents to form coali- tions of users across the Internet who share common interests, and then enables both one-to-one and group conversations among them. It does so with a high degree of pri- vacy, security, and robustness, without requiring its users to place unwarranted trust in any single point in the system. The research advanced here attempts to break a false dichotomy, in which systems designers force their users to sacrifice some part of a fundamental right -- their pri- vacy -- in order to gain some utility -- the use of the application. We demonstrate that, for a broad class of applications, which we carefully describe, this dichotomy is indeed false -- that there is no reason for users to have to make such a decision, and no reason for systems designers to force it upon them. If systems architects understand that there is not necessarily a dichotomy between pri- vacy and functionality, then they will no longer state a policy decision -- whether to ask users to give up a right -- as a technical decision -- one required by the nature of the technology. Casting decisions of corporate or government policy as technical decisions has confused public debate about a number of technologies. This work attempts to undo some of this confusion. The research presented here is thus intended to serve as an exemplar. The techniques presented here, and the sample application which demonstrates them, are intended to serve as examples for other systems architects who design systems that must manipu- late large quantities of personal information. 1.2 What's ahead? In this chapter, we shall: Section 1.3 o Describe which type of privacy we are most interested in protecting Section 1.4 o Discuss the concept of privacy as a right, not a privilege Section 1.5 o Show some of the technical, social, and political problems with centralized manip- ulation of personal information Section 1.6 o Show some of the advantages of a decentralized solution Section 1.7 o Discuss the components of the work presented here, specifically its architecture, the sample application of that architecture, the implementation of that application, and issues of deployment and evaluation Section 1.7 o Briefly summarize the remaining chapters of this dissertation Later chapters will: Chapter 2 o Discuss the system architecture for the general case Chapter 3 o Analyze user privacy and system security Chapter 4 o Detail the sample application -- the matchmaking system Yenta Chapter 5 o Discuss the evaluation of the architecture and of Yenta Chapter 6 o Examine some related work Chapter 7 o Draw some general conclusions 1.3 What are we protecting? Privacy means different things to different people, and can be invoked in many con- texts. We define privacy here as the protection of identifiable, personal information about a particular person from disclosure to third parties who are not the intended recipients of this information. This sentence deserves explanation, and we shall explain it below. We shall also touch upon some related concepts, such as trust and anonymity, which are required in this explanation. Protection Protecting a piece of information means keeping it from being transmitted to certain parties. Which parties are not supposed to have the information is dependent upon the wishes of the information's owner. This process is transitive -- if party A willingly transmits some information about itself to party B, but party B then transmits this information to some party C, which A did not wish to know it, then the information has not been protected. Such issues of transitivity thus lead to issues of trust (see below) and issues of assignment of blame -- whether the fault is in A (who trusted B not to disclose the information, and had this trust violated) or in B (who disclosed the information without authorization to C), or in both, depends on our goal in asking the question. Identifiability Unlinkability In many cases, disclosure of information is acceptable if the information cannot be traced to the individual about whom the information refers -- we refer to this as unlinkability. This is obvious in, for example, the United States Census, which, ide- ally, asks a number of questions about every citizen in the country. These answers to these questions are often considered by those who answer them to be private informa- tion, but they are willing to answer them for two reasons: The collection of the infor- mation is deemed to have utility for the country as a whole, and the collectors of the information make assurances that the information will not be identifiable, meaning that it will not be possible to know which individual answered any given question in any particular way -- the respondents are anonymous. Because the Census data is gathered in a centralized fashion, it leads to a concentration of value which makes trust an important issue: central concentrations of data are more subject to institu- tional abuse, and make more tempting targets for outsiders to compromise. Particular person Whether or not the information is about a particular person -- someone how is identi- fiable and is linkable to the information -- or is instead about an aggregate can make a large difference in its sensitivity to disclosure. Aggregate information is usually con- sidered less sensitive -- although cross-correlation between separate databases which talk about the same individuals can often be extremely effective at revealing individu- als again in the data, and represent a serious threat to systems which depend for their security solely on aggregation of data [169]. Personal information When we use the term personal information, we mean information that is known by some particular individual about himself, or which is known to some set of parties who that individual considers to be authorized to know it. If no one else knows this information yet, the individual is said to control this information, since its disclosure to anyone else is presumably, at this moment, completely up to the individual himself. We are not referring to the situation whereby party A knows something about party B that B does not know about himself. Such situations might arise, for example, in the context of medical data which is known to a physician but has not yet (or, perhaps is not ever) revealed to the patent. In this case, B cannot possibly protect this informa- tion from disclosure, for two reasons: B does not have it, and because the information is known by someone who may or may not be under A's control. Disclosure If personal information about someone is not disclosed, then it is known only to the originator of that information. In this case, the information is still private. One of the central problems addressed by this dissertation is how to disclose certain information so that it may be used in an application, while still giving the subject control over it. Third parties Many existing applications which handle personal information do so by surrendering it, in one way or another, to a third party. This work attempts to demonstrate that this is not always required. In many instances, there is no need to know -- knowledge of this information by the third party will not benefit the person whom this information is about. We usually use the term third party to mean some other entity which does not have a compelling need to know. Intended recipients The intended recipient of some information is the party which the subject desires to have some piece of personal information. If the set of intended recipients is empty, then the information is totally private, and, barring involuntary disclosures such as search and seizure, the information will stay private. The work presented here con- cerns cases where, for whatever reason, the set of intended recipients is nonempty. Trust Whenever private information is surrendered to an intended recipient, the subject trusts the recipient, to one degree or another, not to disclose this information to third parties. (If the subject has no trust in the recipient at all, but discloses anyway, either the subject is acting against his own best interests, or the information was not actually private to begin with -- in other words, if the information is public and it does not mat- ter who knows it, then there is no issue of trust.) Trust can be misplaced. A robust solution in any system, social or technological, that handles private information gen- erally specifies that trust be extended to as few entities, in as minimal a way as possi- ble to each one. This minimizes the probability of disclosure and the degree of damage that can be done by disclosure due to a violation of the trust extended by the subject. Anonymity and pseudonymity In discussing unlinkability of information, such as that expected by respondents to the US Census, we mentioned that the respondents trust that they are anonymous. To be fully anonymous is to know that information about oneself cannot be associated with one's physical extension -- the actual individual's body -- or with any other anony- mous individual -- all anonymous individuals, to a first approximation, might as well be the same person. This also means that the individual's real-world personal reputa- tion, and any identities in the virtual world (such as electronic mail identification), are similarly dissociated from the information. Full anonymity is not always possible, or desired, in all applications -- for example, most participants in a MUD are pseudony- mous [20][33][49][59][60][116]. This means that they possess one or more identities, which may be distinguished from other identities in the MUD (hence are not fully anonymous), but which may not be associated with the individual's true physical extension. The remailer operated at penet.fi.net [77], for example, also used pseud- onyms. There are even works of fiction whose primary focus is the mapping between pseudonyms and so-called true names in a virtual environment [176]. Reputations The reason why the distinction between anonymity, pseudonymity, and true names matters has to do with reputations. In a loose sense, one's reputation is some collec- tion of personally-identifiable information that is associated, across long timespans, with one's identity, and is known to a possibly-large number of others. In the absence of any sort of pseudonymous or anonymous identities, such reputations are directly associated with one's physical extension. This provides some degree of accountability for one's behavior, and can be either an advantage or a disadvantage, depending on that behavior -- those with good reputations in their community are generally afforded greater access to resources, be they social or physical capital, than those with poor reputations. Pseudonymous and anonymous identities provide a degree of decoupling between the actions of their owners and the public identity. Such decoupling can be invaluable in cases where one wishes to take an action that might land the physical extension in trouble. This decoupling has a cost: because a pseudonym, and, particu- larly, an anonym, is easier to throw away than one's real name or one's body, they are often afforded a lower degree of trust by others. A legal definition Another way to look at the question of what we are protecting is to examine legal def- initions. For a US-centric perspective, consider this definition from Black's Law Dic- tionary [14]: Privacy, Right of: The right to be let alone; the right of a person to be free from unwanted pub- licity; and right to live without unwarranted interference by the public in matters with which the public is not necessarily concerned. Term 'right of privacy' is generic term encompassing various rights recognized to be inher- ent in concept of ordered liberty, and such right prevents governmental inter- ference in intimate personal relationships or activities, freedoms of individual to make fundamental choices involving himself, his family, and his relationship with others. Industrial Foundation of the South v. Texas Indus. Acc. Bd., Tex., 540 S.W.2d 668, 679. The right of an individual (or corporation) to withhold himself and his property from public scrutiny, if he so chooses. It is said to exist only so far as its assertion is consistent with law or public policy and in a proper case equity will interfere, if there is no remedy at law, to prevent an injury threatened by the invasion of, or infringement upon, this right from motives of curiosity, gain, or malice. Federal Trade Commission v. American Tobacco Co., 264 U.S. 298, 44 S.Ct. 336, 68 L.Ed. 696. While there is no right of privacy found in any specific guarantees of the Constitu- tion, the Supreme Court has recognized that zones of privacy may be created by more specific constitutional guarantees and thereby impose limits on gov- ernmental power. Paul v. Davis 424 U.S. 693, 712, 96 S.Ct. 1155, 1166, 47 L.Ed.2d 405; Whalen v. Roe, 429 U.S. 589, 97 S.Ct. 869, 51 L.Ed.2d 64. See also Warren and Brandeis, The Right to Privacy, 4 Harv.L.Rev. 193. Tort actions for invasion of privacy fall into four general classes: Appropria- tion, consisting of appropriation, for the defendant's benefit or advantage, of the plaintiff's name or likeness. Carlisle v. Fawcett Publications, 201 Cal. App2d 733, 20 Cal. Rptr 405. Intrusion [ . . . ] Public disclosure of private facts, consisting of a cause of action in publicity, of a highly objectionable kind, given to private information about the plaintiff, even though it is true and no action would lie for defamation. Melvin v. Reid 112 Cal. App. 285, 297 P. 91. [ . . . ] False light in the public eye [ . . . ] 1.4 The right to privacy Why is personal privacy worth protecting? Is it a right, which cannot be taken away, or a privilege, to be granted or rescinded based on governmental authority? Constitutional arguments In the United States, there is substantial legal basis that personal privacy is considered a right, not a privilege. Consider the Fourth Amendment to the US Constitution, which reads: The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affir- mation, and particularly describing the place to be searched, and the persons or things to be seized. While this passage is the most obvious such instance in the Bill of Rights, it does not explicitly proclaim that privacy itself is a right. There are ample other examples from Constitutional law, however, which have extended the rights granted implicitly by passages such as the Fourth Amendment above. Supreme Court Justice Brandeis, for example, writing in the 1890's and later, virtually created the concept of a Constitutional right to privacy [180]. For example, consider this quote, from Olmstead v. United States [130], writing about the then-new technology of telephone wiretapping: The evil incident to invasion of the privacy of the telephone is far greater than that involved in tampering with the mails. Whenever a telephone line is tapped, the privacy of the persons at both ends of the line is invaded, and all conversations between them upon any subject, and although proper, confi- dential, and privileged, may be overheard. Moreover, the tapping of one man's telephone line involves the tapping of the telephone of every other person whom he may call, or who may call him. As a means of espionage, writs of assistance and general warrants are but puny instruments of tyranny and oppression when compared with wire tapping. Later examples supporting this view include Griswald v. Connecticut [71], in which the Supreme Court struck down a Connecticut statue making it a crime to use or coun- sel anyone in the use of contraceptives; and Roe v. Wade [147], which specified that there is a Constitutionally-guaranteed right to a personal sphere of privacy, which may not be breached by government intervention. Moral and functional arguments But the laws of the United States are not the only basis upon which one may justify a right to privacy -- for one thing, they are only valid in regions in which the United States government is sovereign. It is the author's contention that there is a moral right to privacy, even in the absence of law to that effect, and furthermore that, even in the absence of such a right, it is a social good that personal privacy exists and is pro- tected -- in other words, that personal privacy has a functional benefit. In other words, even if one were to state that there is no legal or moral reason to be supportive of per- sonal privacy, society functions in a more productive manner if its members are assured that personal privacy can exist. For example, there are spheres of privacy sur- rounding doctor/patient and attorney/client information which are viewed as so important that they are codified into the legal system of many countries. Without such assurances of confidentiality, certain information might not be exchanged, which would lead to an impairment of the utility of the consultation. One might also argue that the fear of surveillance is itself destructive, and that privacy is a requirement for many sorts of social relations. For example, consider Fried [64]: Privacy is not just one possible means among others to insure some other value, but . . . it is necessarily related to ends and relations of the most funda- mental sort: respect, love, friendship and trust. Privacy is not merely a good technique for furthering these fundamental relations; rather without privacy they are simply inconceivable. For the purposes of this work, we shall take such moral and social-good assertions as axioms, e.g., not requiring further justification. Implications for systems architects Those who design systems which handle personal information therefore have a spe- cial duty: They must not design systems which unnecessarily require, induce, per- suade, or coerce individuals into giving up personal privacy in order to avail themselves of the benefits of the system being designed. In other words, system archi- tects have a moral, ethical, and perhaps even -- in certain European countries, which have stronger data privacy laws than the US -- legal obligations to design such sys- tems from a standpoint that is protective of individual privacy when it is possible to do so. There may be strong motives not to design systems in such a fashion that they are pro- tective of personal privacy. We shall investigate some of the motives, with examples, in the next section, but overall themes include: See Section 1.5. o It is often conceptually far simpler to design a system which centralizes informa- tion, yet such systems are often easily compromiseable, either through accident, malice, or subpoena. o The architects of many systems often have an incentive to violate users' privacy, often on a large scale. The business models of many commercial entities, especially in the United States, depend on the collection of personal information in order to obtain marketing or demographic data, and many entities, such as credit bureaus, exist solely to disseminate this information to third parties. The European Union has data-protection laws forbidding this [47]. o Government intervention may dictate that users' privacy be compromised on a large scale. CALEA [21] is a single, well-known example; it requires that US tele- phone switch manufacturers make their switches so-called tap-ready. Hiding policy decisions under a veil of techno- logical necessity An example from the Intelligent Transportation System infrastructure In many instances, the underlying motives which lead to a system design that is likely to compromise users' privacy are hidden from view. Instead of being clearly articu- lated as decisions of policy, they are presented as requirements of the particular tech- nological implementation of the system. For example, consider most Intelligent Transportation Systems [18], such as automated tollbooths which collect fees for use of roads. These systems mount a transponder in the car, and a similar unit in the toll- booth. It is possible, using essentially the same hardware on both the cars and in the tollbooths, to either have a cash-based system or a credit-based system. A cash-based system works like Metrocards in many subways -- users fill up the card with cash (in this case, cryptographically-based electronic cash in the memory of the car's tran- sponder), and tollbooths instruct the card to debit itself, possibly using a crypto- graphic protocol to ensure that neither the tollbooth nor the car can easily cheat. A transaction-based system, on the other hand, assigns a unique identifier to each car, linked to a driver's name and address, and the car's transponder then sends this ID to the tollbooth. Bills are sent to the user's home at the end of the month. Cash vs credit In other words, a cash-based system works like real, physical cash, and can be easily anonymous -- users simply go somewhere to fill up their transponders, and do not need to identify themselves if they hand over physical cash as their part of the transac- tion. Even if they use a telephone link and a credit card to refill their transponders at home, a particular user is not necessarily linked to a particular transponder if the cryptography is done right. And even if there is such a linkage between users and tran- sponders, there is no need for the system as a whole to know where any particular transponder has been -- once the tollbooth decides to clear the car, there is no reason for any part of the system to remember that fact. On the other hand, a credit-based system works like a credit card -- each tollbooth must report back to some central authority that a particular transponder went through it, and it is extremely likely that which tollbooth made this report will be recorded as well. Same hardware either way; cash is actually simpler Both cash- and credit-based systems can use the same hardware at both the car and the tollbooth; the difference is simply one of software. In fact, the cash-based system is simpler, because each tollbooth need not communicate in real-time with a central database somewhere. (Tollbooths in either system must have a way of either detaining cars with empty or missing transponders, or logging license plates for later enforce- ment, but the latter need not require a real-time connection for the tollbooth to func- tion.) Furthermore, a cash-based system obviates the need for printing and mailing bills, processing collections, and so forth. ITS RFP's implicitly assume that drivers should be tracked Yet it is almost invariably the case that requests for proposals, issued when such sys- tems are in the preliminary planning stages, simply assume a credit-based system, and often disallow proposals which can enable a cash-based system. This means that such systems, from the very beginning, are implicitly designed to enable tracking the movements of all drivers who use them, since, after all, each tollbooth must remember this information for billing purposes. Furthermore, drivers are likely to demand item- ized bills, so they can verify the accuracy of the data. (After all, it is no longer the case that they need worry only about the contents of their local transponder -- they must worry about the central database, too.) Yet such a system can easily be used, either by someone with access to the bill mailed to an individual, or via subpoena or compro- mise at the central database, to stalk someone or to misuse knowledge about where the individual has been, and when. Large-scale data mining of such systems can infringe on people's freedom of assembly, by making particular driving patterns inherently suspicious -- imagine the case whereby anyone taking an uncommon exit on a particular day and time is implicitly assumed to have been going to the nearby political rally. And even the lack of a record of a particular transit has already been used in court proceedings [18]. ITS RFP's are setting policy, not responding to technological necessity The aim of the work presented in this dissertation is the demonstration that many, if not most, of these systems can be technically realized in forms that are as protective of users' individual privacy as one might wish. Therefore, designers of systems who fail to ensure their users' privacy are making a policy decision, not a technical one: they have decided that their users are not entitled to as much personal privacy as is possible to provide, and are implementing this decision by virtue of the architecture of the system. Unnecessary polarization of the terms of the debate While it is the author's contention that most such decisions are, at best, misguided, and at worst unethical, the fact that they are often disguised as purely technical issues polarizes the debate unnecessarily and is not a social good. If some system, whose capabilities would improve the lives of its users, is falsely presented as necessarily requiring them to give up some part of a fundamental right in order to be used, then debate about whether or not to implement or use the system is likewise directed into a false dichotomy. By allowing debate to be thus polarized, and by requiring users to trade off capabilities against rights, it is the author's contention that the designers and implementors of such a system are engaging in unethical behavior. Legitimate reasons against absolute personal privacy There may be many legitimate reasons why absolute privacy a system's users is unde- sirable. It is not the aim of this work to assert that there are no circumstances under which personal privacy may be violated; indeed, the moral and legal framework of the vast majority of countries presupposes that there must be a balancing between the interests of the individual in complete personal privacy, and those of the state or sov- ereign state in revealing certain information about an individual to third parties. This work aims to decouple technical necessity from decisions of policy However, we should be clear about the nature of this balancing. It should be dictated by a decision-making process which is one of policy. In other words, what is the desired outcome? It should not instead be falsely driven by assertions about what the technology forces us to do. The aim of this research is to decouple these two issues, for a broad class of potential applications, and to demonstrate by example that techno- logical issues need not force our hand when it comes to policy issues. Such a demon- stration by example, it is hoped, will also make clearer the ethical implications of designing a system which is insufficiently protective of the personal privacy of its users. 1.5 The problems with centralized solutions It is often the case that applications which must handle information from many sources choose a centralized system architecture to accomplish the computation. Using a single, central accumulation point for information can have a number of advantages for the developer: Why centralized solutions are handy o It is easy to know where the information is o Many algorithms are easy to express when one may trivially map over all the data in a single operation o There is no problem of coordination of resources -- all clients simply know where the central server is, and go there Unfortunately, such a centralized organization has two important limitations, namely reliability and trust. Reliability is an issue in almost any system, regardless of the kind of information it handles, whereas trust is more of a serious concern in systems which must handle confidential information. Reliability A single, central point also implies a single point of failure. If the central point goes down, so does the entire system. Further, central points can suffer overload, which means that all clients experience slowdown at best, or failure at worst. And in systems where, for example, answering any query involves mapping over all or most of the database in a linear fashion, increasing the number of clients tends to cause load on the server to grow as O(n2). Because of issues like this, actual large systems, be they software, business models, or political organizations, are often divided into a hierarchical arrangement, where sub- stantial processing is done at nodes far from any center -- if there even is a center to the entire system. For example, while typical banks are highly centralized, single enti- ties -- there is one master database of the value of each account-holder's assets -- there is not a single central bank for the entire world. Similarly, the Internet gets a great deal of its robustness from its lack of centralization -- for example, there is not a sin- gle, central packet router somewhere that routes all packets in the entire network. Trust Of greater importance for this work, however, is the issue of trust. We use the defini- tion of trust advanced in Section 1.3, namely, trust that private information will not be disclosed. It is here that centralized systems are at their most vulnerable. By definition, they require that the subject of the information surrender it to an entity not under the sub- ject's direct control. The recipient of this information often makes a promise not to disclose this information to unauthorized parties, but this promise is rarely completely trustworthy. A simple taxonomy of ways in which the subject's trust in the recipient might be misplaced includes: How might trust be violated? o Deception by the recipient. It is often the case that the recipient of the information is simply dishonest about the uses to which the information will be put. o Mission creep. Information is often collected for one purpose, but then used later for another, unforeseen purpose. In many instances, there is no notification to the original subjects that such repurposing has taken place, nor methods for the sub- jects to refuse such repurposing. For example, the US Postal Service sells address information to direct marketers and other junk-mailers -- it gets this information when people file change-of-address forms, and it neither mentions this on the form, nor provides any mechanism for users to opt out. Often, the organization itself fails to realize the extent of such creep, since it may take place slowly, or only in com- bination with other, seemingly-separate data-collection efforts that do not lead to creep except when combined. Indeed, the US Federal Privacy Act of 1974 [175] recognizes that such mission creep can and does take place, and explicitly forbids the US government from using information collected for one purpose from being used for a different purpose -- how the USPO is allowed to sell change-of-address orders to advertisers is thus an interesting question. Note, of course, that this Act only forbids the government from doing this -- private corporations and individuals are not so enjoined. o Accidental disclosure. Accidents happen all the time. Paper that should have been shredded is thrown away unshredded, where it is then extracted from the trash and read. Laptops are sold at auction with private information still on their disks. Com- puters get stolen. In one famous case in March 1998, it was revealed that GTE had inadvertently disclosed at least 50,000 unlisted telephone numbers in the southern California area -- an area in which half of all subscribers pay to have unlisted num- bers. The disclosure occurred in over 9000 phonebooks leased to telemarketing firms, and GTE then attempted to conceal the mistake from its customers while it attempted to retrieve the books. The California Public Utilities Commission had the authority to fine GTE $20,000 per name disclosed, an enormous, $1B penalty that was not actually imposed [9]. In March of 1999, AT&T accidentally disclosed 1800 email addresses to each other as part of an unsolicited electronic commercial mail- ing; Nissan did likewise with 24,000 [26]. o Disclosure by malicious intent. Information can be stolen from those authorized to have it by those intent on disseminating it elsewhere. Examples from popular me- dia reports include, for example, IRS employees poking through the files of famous people, and occasionally making the information public outside of the IRS [173]. Crackers, who break into others' computer systems, may also reveal information that the recipient tried to keep private. There is often significant commercial value in the deliberate disclosure of other companies' data; industrial espionage and re- lated activities can involve determined, well-funded, skilled adversaries whose in- tent is to compromise corporate secrets -- perhaps to do some stock manipulation or trading based on this -- or to reveal information about executives which may be deemed damaging enough to be used for blackmail or to force a resignation. Intel- ligence agencies may extract information in a variety of means, and entities which fail to exercise due diligence in strongly encrypting information -- or which are pre- vented from using strong-enough encryption by rule of law -- may have informa- tion disclosed while it is being transmitted or stored. o Subpoenas. Even though an entity may take extravagant care to protect information in its possession, it may still be legally required to surrender this information via a subpoena. For example, Federal Express receives several hundred subpoenas a day for its shipping records [178] -- an unfortunate situation which is not generally ad- vertised to their customers. This leads to a very powerful general principle: If you don't want to be subpoenaed for something, don't collect it in the first place. Many corporations have growing concerns about the archiving of electronic mail, for ex- ample, and are increasingly adopting policies dictating its deletion after a certain interval. The Microsoft antitrust action conducted by the US Department of Justice, for example, entered a great many electronic mail messages into evidence in late 1998, and these are serving as excellent examples of when too much institutional memory can be a danger to the institution. This is hardly a complete list, and many more citations could be provided to demon- strate that these sorts of things happen all the time. The point here is not a complete itemization of all possible privacy violations -- such a list would be immense, and far beyond the scope of this work -- but simply to demonstrate that the issue of trusting third parties with private information can be fraught with peril. Is this software, or a business model? Note that the discussion above is not limited to software systems. Replace algorithm with business practice, client with customer, and central server with vendor, and you have the system architecture of most customer/vendor arrangements. However, we shall not further investigate these structural similarities, except to point out that busi- ness models themselves often have a profound impact on the architecture of an appli- cation. 1.6 Advantages of a decentralized solution Decentralized solutions can assist with both reliability and trust. Let us briefly exam- ine reliability, and consider a system which does not contain a single, central, physical point whose destruction results in the destruction of the system. By definition, there- fore, a single, physical point of failure cannot destroy this system. This says nothing about the system's ability to survive either multiple points of failure, nor its ability to survive a single architectural failure (which may have been replicated into every part of the resulting system), but it does tend to imply that particular, common failure modes of single physical objects -- theft, fire, breakdown, accidents -- are much less likely to lead to failure of the system as a whole. This is nothing new; it is simply good engineering common sense. The issue of trust takes more examination. If we can build a system in which personal data is distributed, and in which, therefore, no single point in the system possesses all of the personal data being handled, then we limit the amount of damage -- disclo- sure -- that can be accomplished by any single entity, which presumably cannot con- trol all elements of the system simultaneously. Systems which are physically distributed, for example, multiply the work factor required to accomplish a physical compromise of their security by the number of distinct locations involved. Similarly, systems which distribute their data across multiple administrative boundaries multiply the work factor required by an adversary to compromise all of the data stored. In the extreme case, for example, a system which distributes data across multiple sovereigns (e.g., governments) can help ensure that no single subpoena, no matter how broad, can compromise all data -- instead, multiple governments must collude to gain lawful access to the data. Cypherpunk remailer chains Cypherpunks remailer chains [10][23][66] are example of using multiple sovereigns. A remailer chain operates by encrypting a message to its final recipient, but then handing it off to a series of intermediate nodes, ideally requiring transmission across multiple country boundaries. In one common implementation, each hop's address is only decodeable by the hop immediately before it, so it is not possible to determine, either before or after the fact, the chain of hops that the message went through. Prop- erly implemented, no single government could thereby compromise the privacy of even a single message in the system, because not all hops would be within the zone of authority of any single government. Costs of a decentralized solution Of course, as applied to the applications we examine in this dissertation, the advan- tages of a decentralized solution do not come for free. They require pushing intelli- gence to the leaves -- in other words, that the users whose information we are trying to protect have access to their own computers, under their own control. Decentralized systems are also somewhat more technically complicated than centralized solutions, particularly when it comes to coordination of multiple entities -- for example, how are the entities supposed to find each other in the first place? And such solutions may not work for all applications formerly handled by centralized solutions, but only for those that share particular characteristics. We will investigate each of these issues in later chapters. 1.7 A brief summary of this research The purpose of the work in this dissertation is to demonstrate that, for a class of simi- lar applications, useful work that requires knowledge of others' private information may nevertheless be accomplished without requiring any trust in a central point, and without requiring very much trust in any single point of the system. In short, such a system is robust against violations of trust, unlike most centralized systems. The work is therefore divided into several aspects, which will be discussed more fully in the chapters that follow, and which are summarized in this section: Chapter 2 Chapter 3 o An architecture which specifies the general class of applications for which we are proposing a solution -- what characteristics are common to those applications which we claim to assist? This architecture also includes our threat model -- what types of attacks against user privacy we expect, which of those attacks we propose to address, and how we will address them. Chapter 4 o A sample implementation of this architecture -- the matchmaking system Yenta. Chapter 5 o Evaluation of the sample application as deployed, an analysis of the risks that re- main in the design and implementation, and some speculations on how certain oth- er applications could be implemented using the architecture we describe. Chapter 6 o An examination of related work, both with regard to privacy protection via archi- tecture, and the sample application's domain of matchmaking. 1.7.1 The architecture and its sample application We present a general architecture for a broad class of applications. The architecture is designed to avoid centralizing information in any particular place, while allowing multiple agents to collaborate using information that each of them possesses. This collaboration is designed to form groups of agents whose users all share some set of characteristics. The architecture we describe is particularly useful for protecting per- sonal information from unauthorized disclosure, but it also has advantages in terms of robustness and avoidance of single points of physical failure. In the description below, the architecture and the sample application described in this dissertation -- Yenta -- are described together. Such an architecture assumes several traits shared by applications which make use of it, of which the most important are the existence of a peer application for each user who wishes to participate, running on the user's own workstation; the availability of a network; the availability of good cryptography; and a similarity metric which can be used to compare some characteristic of users to each other and which enables a partial ordering of similarity. The architecture derives much of its strength from its com- pletely decentralized nature -- no part of it need reside on a central server. Users are pseudonymous by default, and agents are assumed to be long-lasting, with permanent state that survives crashes and shutdowns. Individual agents participate in a hill- climbing, word-of-mouth exchange, in which they exchange messages between pairs of themselves -- with no central server participating in such exchanges. Agents which find themselves to be closely matched form clusters of similar other agents. An agent which is not well-matched to a peer can ask the peer for a referral to some other agent which is a better match, hence using word-of-mouth, based on the above partial order- ing of similarities, to aid in the search for a compatible group of other agents. Once clusters have been formed, agents may send messages into the clusters, commu- nicating either one-to-one or one-to-many. Yenta uses this capability to enable users to have both private and public conversations with each other. Particularly close matches can cause one of the participating agents to suggest that the two users be introduced, even if the users have not previously exchanged messages -- this helps those who never send public messages to participate. We carefully discuss the threat model facing the architecture and the sample applica- tion, discussing which attacks are expected and the measures taken to defend against them. We also discuss what sorts of attacks are considered outside the scope of this research and for which we offer no solution. Strong cryptography is used in many places in the design, both to enable confidentiality and authenticity of communica- tions, and as the infrastructure for a system designed to enable persistent personal rep- utations. Because public evaluation can make systems significantly more robust and more secure, a separate system, named Yvette, was created to make it easier for multi- ple programmers to publicly evaluate Yenta's implementation; Yvette is not special- ized to Yenta and may be used to evaluate any system whose source code is public. 1.7.2 Evaluation The architecture and the sample application have been evaluated in several ways, including via simulation and via a pilot deployment to real users. The qualitative and quantitative results obtained demonstrate that the system performs well and meets its design goals. In addition, several other applications which might make use of the underlying architecture are possible and speculations on how they might be imple- mented are briefly described. We also perform a risk analysis of Yenta and describe potential security risks, including some which are explicitly outside of our threat model. Finally, we describe related work, which includes other types of matchmaking sys- tems, other decentralized systems, and other systems and software that have been designed for explicitly political purposes. We then draw some general conclusions. 1.8 Summary This chapter has presented the social and political motivations for this work, namely the protection of certain civil liberties, such as privacy, by starting with such motiva- tions and then designing technology that can help. We have described what personal privacy and its protection means, demonstrated some of the social, political, and tech- nical problems with centralized solutions, and touched upon some of the advantages of decentralized solutions. We have then summarized, very briefly, the work that will be presented in later chapters. CHAPTER 2 System Architecture 2.1 Introduction In this chapter, we present a general architecture for a broad class of applications. As discussed in Chapter 1, the architecture is designed to avoid centralizing information in any particular place, while allowing programs run by multiple users to collaborate by using information that each of them possesses. Such an architecture is particularly useful for protecting personal information from unauthorized disclosure, but it also has some advantages in terms of robustness various types of failure, including single points of physical failure. This chapter will describe the architecture by answering the following questions: Section 2.2 o The traits shared by the applications we are considering Section 2.3 o The problems are we not addressing in the space of possible applications Section 2.4 o For concreteness, our sample application Section 2.5 o The overall architecture proposed Section 2.6 o Determining one user's characteristics Section 2.7 o Bootstrapping Section 2.8 o Forming groups of agents, including: Section 2.8.1 o Data structures used in clustering Section 2.8.2 o Getting referrals Section 2.8.3 o Privacy of the information exchanged Section 2.9 o Further clarification on the exact nature of a cluster Section 2.10 o Some uses for the resulting groups Section 2.11 o Reputation systems Section 2.12 o Running more than one copy of the application on a single host Section 2.13 o Hooks for collecting evaluation data As discussed in Chapter 1, we obtain a large amount of our privacy and security pro- tection from a decentralized architecture; that architecture is discussed in this chapter. We obtain other elements of protection from the techniques and principles advanced in Chapter 3; that chapter is heavily dependent upon this one. Some technical privacy issues are explained in this chapter In a few sections of this chapter, we delve into particular aspects of privacy and secu- rity in advance of Chapter 3's coverage. We do so because certain strategies for pro- tecting user privacy are more easily explained near the description of some architectural feature than they are in a separate chapter. Several issues are deferred This architectural description defers several topics to later chapters. Some of the design decisions made here will be clearer when the entire picture has been presented. In particular, later chapters will specify: Chapter 3 o How the privacy and security of the architecture really work Chapter 4 o Details of how the sample application, Yenta, makes use of this architecture Chapter 5 o How to evaluate how the system as a whole is performing Chapter 5 o Other applications besides the sample application 2.2 Application traits In the discussion that follows, we take user to be some individual person, application to be some particular user task which is implemented by running a program, and sys- tem to be a set of interconnected users, all running copies of some piece of code that implements the application. A familiar example of such a definition would be the Internet mail system, which consists of users all running applications (mail readers) which all do the same task, even though the applications themselves are not all identi- cal -- they run on different computers, come from different vendors, and have a differ- ent set of features which they implement. Note that the Internet mail system does not quite fit the definition given below of the applications we support; it serves only to make clear what we mean by user, application, and system. Systems, applications, users, instances, and agents For clarity, let us distinguish between the concepts of an application and an instance of an application. The application itself is the body of code that users may run; it is the same for all users who run the same version. The instance of that application is the individual copy that any given user is running on some machine, and includes what- ever personalized state may exist for the user. In the discussion that follows, we refer to an individual instance of some running application as an agent. (Some examples and definitions of agents may be found in [16][27][30][31][45][46][59][60][88][98] [101][106][112][113][114][143][159][160][162][163][164] -- and many which are not listed there are mentioned at appropriate points elsewhere in this dissertation). We define an agent here to be a semiautonomous piece of software running on a particular computer, which may be personalized and has long-term state. We do not consider anthropomorphism or the ability to move the thread of control to another machine (e.g., process migration) to be a part of the definition we use here. The application is implemented by users running a distributed system of agents. Let us turn to the traits which are shared by all the applications we are considering. Later sections will justify some of the assumptions and limitations. o More than one user exists in the system. If there is only one user running the appli- cation, then we do not consider it a system. o The users, and the agents they run, are all peers of each other. There is no distin- guished user or agent, and no pre-established hierarchy. o The application requires that some of its users wish to interact with some of the oth- er users, by sharing some information between them. o Not every user, nor his or her agent, need know about every other user or agent, nor does any user or agent require complete information about all other users or agents. o It is appropriate to group users, on the basis of some attribute, into clusters which all share, to some extent, that attribute. Any given user might be in more than one cluster simultaneously, depending on the user's attributes. o It is possible to form a partial order among user characteristics, such that we can say that some characteristic of user A is more like user B than user C. o It is likely that at least some of the information in the system should be protected from disclosure to others, either inside the system or outside of it. o Each of the users of the system can run their own copy of the application, on some computer at least nominally under their own control. o The users are connected via a high-availability network, such as the Internet. If there is no way to compare user characteristics, and no way to group users into even approximate clusters based on similarity of those characteristics, than many of the assumptions of our architectural model are violated. In particular, the architecture assumes that it can climb a gradient in order to form clusters (see Section 2.8), and that many operations are restricted to users in a particular cluster. If these are not true, then the architecture may not work very well. (Whether it works well enough even if some assumptions are violated is dependent upon exactly what the application is; we shall not further investigate what the properties of such an application might be.) Because we are assuming that there exists information in the system that should be protected against others, and because of the arguments advanced in Chapter 1, partic- ularly in Section 1.5, about the problems of trust when it comes to centralized sys- tems, we assume that users must have the ability to do local processing of information they consider to be confidential. This requires that users have access to a computer that can run the application, and which they may be reasonably assured is under their administrative control, not that of some third party. Systems in which users must do computation in environments they do not control are explicitly not addressed by this work. The applications we are considering are based around the controlled sharing of infor- mation between users. To this end, we assume that there is some way for the users' agents to actually communicate with each other, such that we define the set of agents as a system. For simplicity of discussion, we assume that this requires a network link- ing all agents in close to real time, e.g., the Internet. Generalizations of the fundamen- tal architecture can certainly be made for store-and-forward networks, such as is usually assumed for mail transfer systems, and systems in which users are only infre- quently connected -- such as home users who only occasionally dial up to talk to the network -- but we shall not explicitly address those considerations here. Most of the architecture we present is still usable in such a system, albeit with much greater delays between transactions between agents. Such delays may make the applications inconvenient to use in practice, even if they are theoretically still functional. 2.3 Application traits we are not considering It is clear that the criteria above do not apply to all possible applications. For example, if there is only one user running the application, then we do not consider it a system at all. And if no user needs any information from any other user, then again it is not a system, because all the individual copies of the application do not interact with each other, and are running standalone, in a disconnected configuration. By the same token, we assume that, even though users must communicate with each other, we never have 1-to-n or n-to-n interactions, where n is the set of all users or all agents in the system. There are two reasons to disallow such scenarios: o Robustness. Systems in which any entity, or all entities, must see every other entity in the system tend to become extremely fragile as the number of entities grows. One way to see this, in a distributed system, is to take as a given that some proba- bility p that some single entity will be offline for some reason -- such as crashes, network disconnections, and so forth. We assume that there is no redundancy (all entities must be online and known), that failures are independent of each other, and that there are n entities in the system. This means that the chance that the system as a whole scales exponentially poorly with n. Clearly, such a system will almost nev- er function if n is large and p is not very close to zero. o Security. Implementing the system as a non-distributed, e.g., centralized, system, can help with performance -- if the central node is up, then presumably all informa- tion about all entities is known at that time and may be used. However, this still has unfortunate implications for security, since we have now established a single point of failure at which all entities' information may be compromised. If the system is instead decentralized, but all entities must still know all other entities' information, then the number of points where all entities' privacy may be compromised has now risen to n, the number of entities in the system. The situation is now worse, not bet- ter. We shall have much more to say about the security implications of our assump- tions in Chapter 3. 2.4 Yenta -- the sample application For concreteness, let us mention here the sample application -- Yenta -- that has been developed. Yenta was developed both to test the architecture, and to serve as adver- tisement and role model for the technique. (Recall, from Chapter 1, that the purpose here is to encourage other developers and systems architects to use these techniques to avoid depriving users of their privacy in those other applications.) We will give much more information about Yenta's operation in Chapter 4 -- this is only a very brief summary. Yenta is a matchmaking system. Yenta is not necessarily a romantic matchmaker. Instead, it is designed to facilitate serendipitous introductions of people who may or may not know each other, and to support group interaction among users who share common interests. Two possible scenarios of Yenta's use are: o Inside a company. Many organizations often have the problem that people who should know what each other are doing do not. This is commonly the problem when two people are working on a similar problem, but report to different manag- ers. In this case, it may be that the common point in their reporting structure is suf- ficiently high in the hierarchy that it fails to allow either of the two individuals to know about each other's work. While one might hope that the two individuals might meet accidentally and happen to mention their work to each other, such an event is not assured. (Even if the two do meet, they may fail to mention their com- mon interest -- it is rare that people regale each other with a list of everything they are working on at the moment.) Yenta aims to help, by serving as an introducer for these two, based on this common interest. o Among people who have never met. Here, the problem is one of attention and inter- actional bandwidth. Even if we assume, for instance, that people who share a sim- ilar interest happen to both be on the same mailing list or Usenet newsgroup, not everyone posts. Indeed, if everyone did post, traffic volume might be so high that keeping up with the discussion might prove impossible. Yenta aims to help intro- duce lurkers -- those who rarely or never post -- to others who share their interests, without forcing them to speak publicly, and without subjecting everyone to the re- sulting traffic. Each user runs his or her own copy of Yenta. Each Yenta determines its user's interests by scanning his or her electronic mail and files -- this is one of many reasons why Chapter 3's discussion of privacy and security is so important. Agents join clusters of others, whose users share one or more interests, and users may send messages to indi- viduals in the cluster or to the entire cluster as a whole. Users are pseudonymous, and their identities are never revealed by Yenta itself. (If a user sends a message to another that explicitly states his or her identify, that is not Yenta's concern.) Because pseud- onyms are the norm, Yenta also makes available a reputation system to aid in deter- mining whether to accept an introduction to another user, to help provide some context in interpreting another user's messages, or to enable automatic rejection of messages from users whose reputations are not good enough. 2.5 The overall architecture The overall system architecture is a distributed, multi-agent system. Each user runs his or her own copy of the application -- an agent. The agent has access to persistent, storage on the user's computer, e.g., a filesystem. This filesystem is used to store state across crashes and shutdowns. It may also be used for other purposes -- for example, in Yenta, it is used as the source of the user's interests. The agent is assumed to run for long periods of time -- effectively indefinitely -- rather than being started up and shut down soon thereafter. It is thus assumed to be available to the user, and the rest of the network, most or all of the time. All communications and on-disk storage are assumed to be encrypted; Chapter 3 has much more to say about this requirement. Agents communicate with each other by opening connections to each other across the network (using TCP [135] except in certain unusual circumstances, as below). Since not all copies of the given application should be assumed to be the same version, agents should identify themselves early in any given communication by specifying their current version information, a list of protocols or operations handled, or both -- this aids in interoperability, allowing newer agents to be backwards-compatible with older agents where feasible. Each agent must also be able to communicate with its user. We assume, for simplicity, that the user possesses a web browser, and the agent speaks HTTP [12][52] to that browser. This greatly simplifies design of the application, since emitting HTTP is a much easier implementation challenge than the engineering that goes into the typical browser. A diagram of the basic structure appears below. Figure 1: Yentas talk to each other and to their users' web browsers 2.6 Determining one user's characteristics The architecture assumes that users have particular characteristics that make them suitable candidates for clustering into groups. Members of the group share at least one characteristic, to some degree, in common. How these characteristics are deter- mined is in large part application-specific; we discuss the case for Yenta in Section 4.7. An example from Yenta We assume that these characteristics are comparable in some algorithmic fashion. We specified this in Section 2.2 when we said that we must have a partial order available in comparing one user's characteristics to another. In the case of Yenta (see Section 4.4.4), these characteristics are sets of weighted vectors of keywords, and the comparison is performed by dotting vectors together. Any given user may have several characteristics. For example, in Yenta, any given user is presumed to have several interests at the same time. These characteristics are assumed to be sufficiently different from each other that our comparison function con- siders them dissimilar from each other -- if this were not the case, then at least two of these characteristics should be merged into a single characteristic. 2.7 Bootstrapping When an agent is starting up for the very first time, it may not know, a priori, of any other agents for the application. In this case, it may use a bootstrapping phase in which it undergoes a discovery process that finds at least one other instance of the application. After this bootstrapping phase is accomplished, it need not be repeated. This bootstrapping process can take many forms. Examples include: o Broadcasting on the local network segment, for networks that support broadcasts o Asking the user for any other machines known to be running the application o Having existing agents periodically register their existence with a central server -- the bootserver -- and having newly-created agents ask this server for possibilities Security of the bootserver Yenta uses all three of these strategies. We shall have more to say about the security implications of this in Chapter 3; however, note for the moment that the only relevant aspect of this bootstrapping phase is that the agent find any other instance of itself with which to communicate. That instance need not share any of the user's character- istics. This makes design of the bootstrap server both simple and secure, since it need not maintain any identifiable user information, except the IP address at which some agent was found recently -- for most applications, this is not a serious infringement upon user privacy. If the database is accidentally destroyed, it will be regenerated as running agents periodically register. The central server may also, of course, be spe- cific to a particular organization if desired, rather than there being a single such server on the entire Internet. Note that if the application being considered is so ubiquitously deployed that the chances are very high of another one of its agents existing on the local broadcast net- work segment, or of a new user already knowing of another agent, the central server becomes redundant. Bootstrap broadcasts are very different from cluster broadcasts Be aware that agent broadcasts, used in the sense we mean here for bootstrapping, are not the same sort of mechanism that we specify in Section 2.10, when we talk about communicating with a group of other agents. This is an important distinction: o Cluster broadcasts, as described in Section 2.10, use encrypted, point-to-point transmission of messages, which are then recursively flooded to neighboring agents using the same mechanism. The flooding algorithm is designed to prevent loops by detecting graph cycles. Messages are transmitted via TCP [135]. o Bootstrap broadcasts, as described here, use cleartext, broadcast-medium trans- mission. On IP networks, this use is accomplished via UDP [134], since UDP sup- ports broadcast, whereas TCP does not. Since we are not transmitting any personal information in a bootstrap broadcast -- indeed, since the broadcasting agent may not have any yet -- and since the message is intended for maximum reception, we do not encrypt its contents. Broadcast responders must wait a random time before responding! For broadcasting to work, all agents must be prepared to listen for, and respond to, bootstrap broadcasts. In general, both broadcast requests and replies should include information about the application -- to enable multiple applications to share the same port -- and its version -- to enable backwards-compatibility with older applications. In addition, listeners on Ethernet-like [82] networks must implement random delay in their responses, so as to avoid a packet storm due to collisions on the wire caused by many agents responding at exactly the same time. Ethernet implementations are gen- erally designed to incorporate random exponential backoff, such that collisions cause all transmitters to wait a random, exponentially-increasing amount of time before each retransmission, but such packet storms can still last tens of seconds on a network segment with many responders. In the case of Yenta, for example, agents responding to a broadcast wait a random time, continuously and uniformly distributed between 0 and 2 seconds, before responding to any request. Since transmitting a packet takes between 10 and 100 microseconds, the chances of many responses colliding are negli- gible. 2.8 Forming groups of users -- clustering We now come to the core idea which makes our distributed system function, namely how agents are supposed to find each other and how they organize into clusters. Any given agent starts knowing at least one other agent, via the bootstrapping mecha- nisms described in Section 2.7 above. Agents then use one-to-one communication of their characteristics, and a referral algorithm, to find suitable clusters. 2.8.1 Data structures used in finding referrals and clusters For concreteness, assume that we have two agents, named A and B, which each have a few characteristics associated with them, e.g., CA0, CA1, etc. Each of these character- istics describes something about the agent's user. Each agent also contains several other data structures: o A cluster cache, CC, which contains, for each characteristic, the names of all other agents currently known by some particular agent as being in the same cluster for that characteristic. Thus, if agent A knows that its characteristic 1 is similar to char- acteristic 3 of agent B, then CCA contains an entry linking CA1 to CB3. There are two important limits to the storage consumed by such caches: the number of local characteristics, cl, that any given agent is willing to remember about itself; and the number of remote characteristics, cr, that this agent is willing to remember about other agents. The total size of CC is hence bounded by c1 times cr. In an implemen- tation that wishes to save space, limiting cr before limiting cl makes the most sense, as this limits the total number of other agents that will be remembered by the local agent, while not limiting the total number of disparate characteristics belonging to the user that may be remembered by the local agent. o A rumor cache, RC, which contains the names and other information, as described below, from the last r agents that this agent has communicated with. Implementa- tions should bound this number, since otherwise any given agent will remember all of the agents it has ever encountered on the net and its storage consumption will grow monotonically. Reasonable values for bounds are application-specific; Yenta uses values of 20 to 100. o A pending-contact list, PC, which is a priority-ordered list of other agents that have been discovered but which the local agent has not yet contacted. The rumor cache contains more than just the names of other agents encountered on the network. It also contains some subset, perhaps complete, of the value of each characteristic corresponding to those agents. Exactly how much of each characteristic is stored is application-specific. 2.8.2 Referrals and clustering Now that we have all this mechanism in place, performing referrals and clustering is relatively uncomplicated. Comparing one agent with another The process starts when some agent (call it A) has ascertained its user's characteris- tics, and has found at least one other agent (call it B) via bootstrapping. The two agents exchange characteristics. Agent A then performs a comparison of its local characteristics with those of agent B. Agent A builds an upper-triangular matrix describing the similarities between each of its local characteristics and those locally held by B. Then it finds the highest score(s) -- e.g., closest similarity -- between any given characteristic (say, CA1) and B's characteristics. If there is no such value above a particular threshold, then the local characteristic under consideration does not match any of B's characteristics, although some other local characteristic, e.g., CA2, might match. Note that this inter-agent similarity metric cannot, in general, assume that it knows about all or even most of the other agents on the network. Hence, algorithms which assume that they can take means or do standard deviations to compute whether this is a particularly good match do not have the data to make this determination. Instead, the application must either use fixed thresholds, or attempt to refine its criteria after seeing some number of other agents' characteristics -- which implies that the compar- ison metric is nonmonotonic, e.g., that it may behave differently for different inputs based on its prior history. In the sample application -- Yenta -- a simple thresholding scheme is used. When we are done comparing characteristics from A with characteristics from B, agent A may have found some acceptably close matches. Such matches are entered, one pair of characteristics at a time, in A's cluster cache. B is likewise doing a com- parison of its characteristics with A and is entering items in its own cluster cache for its own use. Comparisons are not symmetric Since each agent is making its own determination of similarity, and since they may be running different versions of the application, or have different local data available -- nothing specifies that an agent must transmit all of its information about a particular characteristic to any given other agent -- they may reach different conclusions. In other words, A may decide that B shares some characteristic with A, whereas B may not decide that it shares any characteristics with A. This asymmetry is perfectly acceptable. In the case above, it means that A will enter B in its cluster cache for some characteristic, but B will not enter A in its cluster cache for any characteristic. Getting referrals Whether or not any matches were found that were good enough to justify entering them in a cluster cache, the next step is to acquire referrals to agents that might be better matches. In the example here, agent A asks agent B for the entire contents of its rumor cache, and runs the same sort of comparison on those contents that it did on agent B's own local characteristics -- but with a more forgiving threshold for what constitutes a good match. For example, if the comparison metric were to return a value between 0 and 1, ranging from no match to perfect match, then the threshold used to determine whether to add some characteristic from B to A's cluster cache might be 0.9, while the threshold used to determine whether a rumor-cache match is good enough might be 0.7. The purpose of using a more forgiving threshold is to allow A to find someone else who might be reasonable, even if they aren't a great choice. Agent A will then add the agent corresponding to each such match to its pending-contact list, and will contact them in turn. Agent A, having now acquired some likely candidates, will execute the same algo- rithm it just used with B: It will see if any of the agents is suitable to be added to A's cluster cache, and will also find other candidates who might be worth contacting. If the pending-contact list is kept sorted by desirability -- presumably, by sorting the pending agents to contact by the result of the comparison metric -- then A is executing a hill-climbing algorithm to finding a good match. In other words, if we model a land- scape in which the height of any given hill is its similarity to some characteristic of A's, and A's current set of candidates as some point on the hillside, A should attempt to always travel in the direction of maximum upward gradient, essentially climbing hills in this space until it reaches a maximum. Note that we are climbing a different landscape, composed of different hills, for each characteristic. Hill-climbing versus local maxima Hill-climbing algorithms can get stuck at local maxima which are not global maxima. In practice, this appears not to happen in our sample application, neither in simulation nor in actual use. To get stuck at a local maxima requires that the system act thermo- dynamically cold, in the sense of simulated annealing. Here the metaphor is one of energy -- a marble rolling around in a potential well cannot escape this well unless it possesses enough energy to roll uphill past an adjacent peak. Similarly, one balanced on a hillside might roll into the valley, but cannot hope to reach an even higher hilltop unless it something gives it extra energy. Random additions of extra energy -- which may eventually roll a marble out of a stuck state -- are thus similar to heating a system, hence we can talk about the thermodynamic temperature of a system. Real data appears to be noisy enough that local maxima which are not global maxima are not a problem -- there is enough inaccuracy in the comparison function, and in the data it is applied to, that agents do not get stuck. Furthermore, in a real system, one might expect that agents are constantly joining (and perhaps leaving) clusters, which will also tend to disrupt many such local maxima -- it only takes one new agent that is a little better matched to knock some agent off its local maximum. It is entirely possible that one can generate disconnected islands of agents which do not know about each other, and there is no feasible way to completely eliminate this possibility if we assume -- as we do explicitly in Section 2.2 -- both that there is no central point in the system that knows about all agents, and that no agent is required to know about all others. However, such islands are likely to be rare, for several reasons: o The bootstrap server (see Section 2.7) tends to tell brand-new agents about many existing agents, all over the world, which tends to ensure a wide sample of starting agents. o It only takes one bridge between two formerly-disconnected islands to inform a large numbers of agents about each others' existence. The referral algorithm tends to encourage this behavior, since many agents will spread the news. Metrics must allow a partial order Of course, for this to work at all, the comparison metric must make available a gradi- ent, via a partial order, as specified in Section 2.2 -- this is why the comparison func- tion must not be a simple, binary predicate. Exactly how this predicate works is application-specific, but it must return some scalar value that we can compare. Issues of thermodynamic noise also tend to avoid pathologies, such as partial orders that lead to cycles (A>B>C>A). It may be the case that some applications can suffer from this problem; but we have not observed it here, and determining the exact conditions under which such pathologies might occur is beyond the scope of this work. If we do not have a comparison metric which allows hill-climbing, then the referral process degenerates to a process more resembling diffusion in a gas -- each agent sim- ply explores the space of other agents at random. Results will still be obtained in this scenario, but very slowly -- the situation goes from something approximately O(n) to O(n2). Another way to look at this is to imagine that each agent is walking around in some physical space: a gradient-driven process moves the agent O(n) steps from the origin, where n is the number of iterations, whereas a random process moves the agent only O() steps from the origin. Cluster cache is not for third-party data Note that agent A never adds some agent, say W, to its cluster cache on the basis of B's say-so. After all, B's idea of W's characteristics could be wrong for any number of reasons. For example: o W's data might be out-of-date or otherwise stale. o W might have deliberately omitted some data in its transmission to B, perhaps based on some aspect of B's network address or reputation (see Section 2.11). o B's idea of W's data might not even truly belong to W at all -- see Chapter 3 for why this might be so. For all of these reasons, we use B's rumor cache information only to add potential candidates to A's pending-contact list. When A eventually contacts any given candi- date, a good match will be added to A's cluster cache in the usual way. Referrals are like human word-of-mouth This procedure acts somewhat like human word of mouth. If Sally asks Joe, 'What should I look for in a new stereo?' Joe may respond, 'I have no idea, but Alyson was talking to me recently about stereos and may know better.' In effect, this has put Aly- son into Sally's pending-contact list (and, if Joe could quote something Alyson said that Sally found appropriate, perhaps into Sally's cluster cache as well). Sally now repeats the process with Alyson, essentially hill-climbing her way towards someone with the expertise to answer her question. 2.8.3 Privacy of the information exchanged The description so far suffers from a number of unfortunate security problems. For instance, when agent A sends its characteristics to agent B, B knows everything that A sees fit to tell it -- and also knows A's IP address, hence making backtracing the infor- mation to the actual user possibly very easy. Furthermore, B will propagate informa- tion about A to any third parties which may care to ask B for its rumor cache, and this will continue to be true until B decides to flush A's information from its rumor cache -- which could be never, since when to flush this information is entirely at B's discretion. We have two strategies for avoiding this outcome: hiding the identity corresponding to any given characteristic, and mixing others' clusters into the local user's data. In practice, we do both. Hiding identities via random reforwarding and digital mixes We can use several strategies to hide the identity corresponding to a given characteris- tic. Techniques related to random reforwarding and digital mixes are discussed more extensively in Section 3.4.3. They depend both on anonymity of individual agents and the ability to broadcast into groups of agents, using keys known only to a subset. Plausible deniability via other agents' data One way of establishing a user's probable or possible innocence -- in the terminology of Section 3.2.2 -- without having to go to the extremes of Section 3.4.3 is by includ- ing other users' data with our own. To enable plausible deniability of characteristics, it suffices for an agent to lie. In addition to offering its own characteristics, the agent can offer some characteristics that are currently stored in its rumor cache. By defini- tion, such characteristics are not only not those of the offering agent, but they do not even reflect any of its own characteristics accurately -- if they did, they would be in the agent's cluster cache, not its rumor cache. The agent offering the characteristics certainly knows which ones came from its cluster cache -- and thus reflect the charac- teristics of its user -- and which came from the rumor cache -- and thus do not. How- ever, the agent receiving these characteristics has no way to know. Depending on the size of its rumor cache, the deceitful agent could easily be able to offer, say, ten times as many characteristics as it really owns. Thus, the probability of any single characteristic offered by the agent actually reflecting some characteristic of its user would be only 10%. Assuming that an agent is willing to store arbitrarily many characteristics in its rumor cache -- and is willing to subject it and all of its peers to an arbitrary amount of work -- this percentage can be made arbitrarily low. In order to know which characteristics actually belong to a given agent, an attacker would have to be a party to many exchanges, looking for those characteristics which are always offered -- such characteristics presumably correspond to the real character- istics of the agent's user. This attack could only work if the agent of interest either offers only subsets of its rumor cache, or runs long enough to flush entries from its rumor cache. A local eavesdropper -- one who can listen to all of the given agent's traffic -- could not accomplish this, because we assume, as advanced in Chapter 3, that all communications are routinely encrypted. Instead, the attacker would have to actu- ally compromise many agents on the network, and each of those agents would have to interact with the target agent, for the attack to succeed. While this is possible, it vio- lates our assumption in Section 3.2.1 that an attacker does not control an arbitrarily high proportion of all agents with which the target agent interacts. 2.9 What exactly is a cluster? In the discussion above, we have used the term cluster as if it denotes a particular, well-defined group of agents, and as if all agents within the cluster agree on its mem- bership. This is not in fact the case. Let us examine the meaning of a cluster more closely. A cluster is not a simple transitive closure Consider the point of view of a single agent A, which believes itself to be in a cluster of agents which share characteristic C. This cluster is composed of all other agents in A's cluster-cache for C. It is also composed of all of their cluster-cache entries for characteristic C, and so on. In other words, if we treat the existence of some agent B in some agent A's cluster cache as a unidirectional link from A to B, then A's cluster is the transitive closure, starting from A's cluster cache for C, of all agents which are reachable by traversing these links. The links are unidirectional, e.g., forming a digraph and not a graph, because membership in a cluster cache is not guaranteed symmetric -- see Section 2.8 above. If all agents shared exactly the same value for C, then this definition could be recur- sively enumerated by A, simply by walking this digraph, keeping track of which agents have been visited, in the manner of a mark-sweep garbage collector [90]. One might argue that A shouldn't walk this digraph -- this would eventually result in A having to remember every agent in its cluster, which violates the architecture criteria in Section 2.2 -- but it would at least be theoretically possible. Characteristics are likely to be unique However, all agents presumably do not have exactly the same value for C. We assume that characteristics may be complicated entities, capable of taking on a large number of values. For example, in Yenta -- see Chapter 4 -- characteristics are weighted vec- tors of keywords. In this application, the exact makeup and weighting of any vector is unlikely to be reproduced by any other agent. An example from Yenta Continuing our Yenta-based example, suppose that we have three agents, each with slightly different interests. Yenta X's user is interested in cats. Y's user is interested in both cats and dogs. Z's user is interested in dogs. A schematic of this situation appears in Figure 2 below, where ellipses represent -- approximately -- the set of agents each Yenta considers to be in its own cluster. Note that the cluster names, C1-3, are for explanatory convenience only -- as we stated immediately above, clusters have no overall name of their own, but are described only by the set of which agents consider themselves to have similar characteristics. Figure 2: Clusters and overlaps Assume, for the sake of discussion, that the metric which compares interests looks only at overlaps in words in the keyword vectors exchanged. This means that X and Y consider themselves to be in cluster C1 (they are both interested in cats), and Y and Z consider themselves to be in cluster C2 (they are both interested in dogs). However, should X and Z consider themselves to be in the same cluster? The answer is no. X and Z are not both in C1, C2, or even some third cluster, given the interests expressed here. As far as we can tell from the comparison metric -- which states that a shared interest must involve an overlap in keywords -- X and Z are not interested in the same thing. What is a gerrymandered cluster? This means that X should not walk the digraph of all other agents' cluster-cache entries in order to compute which other agents are in its cluster -- to do so would incorrectly cause X to believe that Z is in cluster C1, when it most clearly is not.; Z's user has no interest in cats. We refer to such an outcome -- in which X would believe that Z is in cluster C1 -- to be a gerrymandered cluster. We use this term by analogy with its political use: a gerrymandered electoral district is one that has been stretched out of its natural shape -- generally one with close to minimal circumference for its area -- into one that unnaturally includes areas that seem better connected to different districts. Similarly, a gerrymandered cluster is one that unnaturally includes too many characteristics which, in reality, have nothing to do with each other. In effect, viewing interests as areas, such a cluster is stretched out in nonsensical ways. Trusting other agents' judgments leads to gerrymandered clusters Why would this happen? Because X, in recursively enumerating the members of clus- ter C1, would be trusting the judgment of Y about what an interest really means. As far as Y is concerned, it is in a single cluster, C2, which happens to specify interests which mention either cats or dogs. But this is not a view shared by either X or Z, whose interests are more restrictive. No global ontology No distinguished cluster names Remember that nowhere have we stated that characteristics (in the general case) nor interests (in the case of Yenta) have distinguished names or some other attribute that would make them unambiguously identifiable as being the same, or different, across all agents in the system. We have provided no central authority to impose a consistent ontology on all agents in the system. Furthermore, for all agents to reach a consensus among themselves, we would have to provide some mechanism to permit, in the limit, propagating such a proposal to the entire system and making it consistent. We have provided no such mechanism. Instead, we provide only the assurance that there exists a metric which can compare one agent's characteristics with another and to reach a local, not a global, decision about similarity of characteristics. Thus, one agent should not trust another about what a characteristic for a third agent really means, because one agent has no assurance that another shares its ontology. All such judgments must necessarily be local -- meaning that, if X is to make a determina- tion about whether Z shares some characteristic with it, it needs to examine Z's data directly. It cannot trust the judgment of some intermediate agent Y. This does not mean that X must communicate directly with Z to make this determination, however. As long as X may be assured that it receives a faithful copy of Z's data, no matter where this copy comes from, X may make the comparison. But it must make the com- parison itself. 2.10 Using the resulting clusters Once we have clustered agents based on characteristics shared by their users, what can we do with the resulting clusters? We shall investigate some uses of these clusters below. Applications which fit the criteria advanced in Section 2.2, but are substan- tially different from Yenta, may have additional uses for these clusters. The basic operations we will investigate here concern: o Communicating from one user to a single other user o Broadcasting to all other users in a cluster o Hiding the origin and destination of communications By the end of this subsection, we shall also have derived the rationale and use for the basic components of any message transmitted -- namely, a tuple consisting of the mes- sage itself, a unique-ID, and a cluster characteristic. Many ways of presenting such messages are possible; their real-time or close to real-time nature makes it reasonable to use an email-like user interface, or something akin to Zephyr instances [1][36]. 2.10.1 One-to-one communication In the simplest scenario, one agent simply transmits a message to some other agent, using the same sort of network connections as are used to swap characteristics. Whether or not the two agents are in the same cluster is irrelevant -- once one of the agents has found the IP address of another, a connection may be opened. However, it is presumed that most such communications are between agents which believe each other to share characteristics -- loosely, they are in the same cluster -- because we pre- sume that users who share characteristics have the most to say to each other. 2.10.2 Broadcasting to all agents in a cluster .A more complicated scenario involves sending a message to all other agents in a cluster. In this case: o The broadcasting protocol should be efficient, and must terminate. o We must handle the case of gerrymandered clusters, as described in section Section 2.9. Efficiency Efficiency in the protocol means that no one agent should be required to do all the work of communicating with all other agents in its cluster. (Indeed, as shown in Section 2.9, it cannot even determine exactly what all the other agents in the cluster are.) Hence, the way we implement broadcasts is to use a flooding algorithm, familiar from the Usenet news system [83]. When an agent wants to send a message to all other agents in its cluster, it sends it to all other known agents in its cluster cache, with instructions that the message should be forwarded to all other agents in their cluster caches, and so on recursively. Termination If this was the entire protocol, it would fail to terminate, because the possibility exists that there will be cycles in the digraph describing which agents are in which other agents' cluster caches. A message sent into this graph would circulate endlessly. To avoid this, messages are tagged with a unique identifier (UID), and every agent com- pares incoming broadcast messages with a cache of recently-seen UID's. If this mes- sage has been seen before, it is dropped immediately, and not propagated. The UID cache in each agent must preserve incoming UID's long enough that there is a low probability that the message might still be circulating by the time it is timed out of the cache. This probability need not be zero, and cannot be: If we assume bounded storage in any given agent, but also assume that any agent may receive a message, crash, and then stay down an arbitrary length of time before coming back up and attempting to send the message, then we cannot set any particular timeout that is long enough. Instead, we must merely guarantee that the effective gain of the system -- the number of messages emitted by any given agent, on average, for a single message received -- is low enough that messages are eventually damped out. If this is the case, then circulating messages will eventually vanish from the system, even though any given agent may occasionally see a duplicate message from some time far in the past. (Applications which cannot ever tolerate a duplicate message must arrange to main- tain UID's forever, or must reject messages older than a certain age as part of their fil- tering algorithm.) Avoiding gerrymandering We now turn to the case of gerrymandered clusters. Consider the case of the three example Yentas described above in Section 2.9. Suppose that Yenta X wishes to broadcast to its cluster. Clearly, Y should receive such a broadcast, because the two Yentas share an interesting in cats. However, Z has no interest in such a message, nor would any other Yentas in C3. This means that Z must have some way to know that it should drop the message -- otherwise, messages intended for what X considers C1 (and what Y considers C2) would also propagate into C3, and presumably far into clusters beyond as well. To avoid this scenario, messages that are transmitted also include the characteristic which describes the cluster, from the point of view of the original sender of the mes- sage. It is very important that this is the original sender's characteristic -- if this were not the case, then third-party recipients of the message (Z in our example) would again be heeding some intermediate party's idea of what a given cluster was about. Given that the characteristic is transmitted along with the message, each agent in the chain can evaluate whether the message still seems relevant to its own set of clusters. If the message is relevant to none, then it is dropped. (Note that it is possible that X's original characteristic might be deemed to match more than one cluster in some receiving agent; in that case, the message should be duplicated and broadcast into each cluster.) In order to aid agents receiving one-to-one (non-broadcast) messages, and to make the protocol simpler by increasing commonality between the two cases, we also transmit the relevant characteristic along with the message even in the one-to-one case. We can only do this if the transmitting agent actually knows which cluster the recipient's agent is in; it may be the case that the user wishes to transmit a message to a particular agent irrespective of its cluster. In this case, no characteristic will be sent. A complete message tuple We have thus arrived at the complete set of tags that must accompany any given mes- sage between agents. A complete message thus consists of: o The message itself. o The message's UID. o The characteristic associated with the cluster -- required if a broadcast, suggested if one-to-one. 2.10.3 Hiding identities Let us now consider the case in which it is important to hide the identify of the send- ing or receiving agent. We shall investigate this case in more detail in Chapter 3, but we should point out here that this capacity is important to make available. Without the ability to hide message originators and recipients, traffic analysis may be employed to guess information about the agents in the system. For example, given the three-Yenta scenario in Section 2.9, suppose that we are an eavesdropper who can monitor communications between agents, even though we may not be able to decrypt them. If we know, though some mechanism, that Yenta X is interested in cats, and see substantial message traffic between X and Y, we can make a reasonable guess that Y is interested in cats as well. The easiest way to defuse this threat is to send any message for a given agent in a cluster to all agents in the cluster -- in other words, to broadcast it. Assuming that the connectivity of the cluster, and the characteristics of each agent in it, are suitable, we have an arbitrarily high probability that the target agent will receive at least one copy of the message. Obviously, if the message is also intended to be private, it must be encrypted using a key that only the recipient knows; we will address this more fully in Chapter 3. All agents which receive the broadcast attempt to decrypt it, but only the target agent possesses the correct private key; all other agents fail to decrypt the mes- sage and simply drop it. This is the general idea behind Blacknet [118], an idea sug- gested in the Cypherpunk community as a way to anonymously trade secrets, yet foil traffic analysis, by broadcasting any given message to the entire world via Usenet news, yet encrypt it only for its intended recipient(s). This means that, in the general case, even one-to-one messages are broadcast. They are propagated, as part of foiling traffic analysis, by all agents which deem the mes- sage to be close enough to one of their existing clusters. Because actual message being propagated is encrypted, it may only be read by a subset -- possibly singular -- of the agents. This is clearly not as conservative of network resources as direct, point- to-point connections, but it is far safer if widespread eavesdropping and traffic analy- sis is considered to be a threat. If proper Mixmaster [10][23][66] dithering of the tim- ing and size of transmissions is employed -- by padding all messages to the same size, sending garbage messages when there is nothing to send, and sending messages either at totally random times or totally periodic times -- it is possible that both sender and receiver could be beyond suspicion, as in the definition in Section 3.2.2. We will address further aspects of this mechanism, including its behavior against active attackers and widespread traffic analysis, in Chapter 3. 2.11 Reputations It is expected that this architecture will be used for applications which handle per- sonal data. Much of the strength of the privacy-protecting features of the architecture (see Chapter 3) derives from the use of pseudonyms in place of real user identities. Trolling and spoofing Given this, how does any user know anything at all about another user of the system? For example, in Yenta, how does a user know that the person on the other end of some link is not his or her supervisor, romantic partner, or family member, trolling for inter- ests that the user would rather not admit to? This is an example of the more general problem of spoofing -- some user pretending to be someone else. In general, this is a difficult problem. We shall sketch out our overall approach to it here, but many of the details must wait until Chapter 3 provides essential background and algorithms. The architecture we present attempts to solve this problem by using reputations. Users may make any number of statements about themselves, called attestations, which are cryptographically signed by other users via their agents. These attestations are associ- ated with the user's pseudonym -- their Yenta-ID in Yenta, for example -- and not their real identity, which may be unknown even to the user's own agent. It is beyond the scope of this architectural description to specify exactly how these other users acquire the trust to sign someone's attestation -- in many cases, such as inside an organization, the users may be known to each other and therefore may sign each other's attestations on the basis of this shared knowledge. In other cases, such trust may come from long association and interactions through the application. The web of trust When two agents communicate, they may trade attestations. A user attempting to ver- ify an attestation, whom we will call the verifier, must examine the signatures associ- ated with the attestation, and must either convince himself that someone known to the user is one of the signatories, or that one of the signatories themselves has been endorsed (via their signed attestation) by someone known to the verifier. The verifier is therefore attempting to construct a chain of signatures which terminates at one or more other users already known to the verifier. This tactic is exactly the same as is used to verify the identity corresponding to PGP public keys [187], and is called a web of trust. The details of how identities are handled, and the cryptographic algo- rithms used to sign attestations, are deferred to Chapter 3. Verifying attestations is a fundamentally peer-to-peer operation. There is no trusted certifying authority, and no assumed hierarchy to the signatures being presented. How many signatures, from whom, and the exact structure required of the signature chain is completely up to the verifier's discretion. The verifier's policy may change depend- ing on the use to which the information will be put -- for example, in Yenta, a conver- sation to some unknown other user about a noncontroversial topic may not require any verification at all. Word-of-mouth reputations Like the referral algorithm described in Section 2.8, this is a word-of-mouth approach. It resembles the stereotype of small-town gossip and reputations, although this analogy is not exact -- in small towns, the gossip is usually about third parties, whereas here the statements made are about the person who is making the statement. There is nothing preventing a single distinguished signer -- some signer that is well- known to a large fraction of users -- from becoming established. This requires only that all users know about this signer, and that they trust it. Such a scenario is likely in an organization, which may have designated some individual to hold corporate cryp- tographic keys or the like, and which can disseminate to all users, through some mechanism not specified here, who the signer is and why the other users should trust it. However, such a distinguished signer is outside the scope of this architectural description; it is a local policy issue. Any given user's attestations are stored (and offered) by his or her own agent. This must be so, because there is in general no distinguished location in the system to ask about any other user's reputation -- the attestations come from the user himself. Because the user owns his own attestations, it is likely that only positive attestations, e.g., those that cast the user in a favorable light, will be offered. Verifiers thus walk a fine line in their judgments about attestations: while excessively positive attestations are unlikely to be signed by anyone trustworthy, negative attestations are unlikely to exist at all. Additional details about the cryptographic operation of attestations is provided in Chapter 3. Yenta's use of attestations is described in Chapter 4. 2.12 Running multiple agents on one host The architecture presented here has a rather unusual problem, namely, how can multi- ple users run the application simultaneously on the same host? At first glance, this appears completely straightforward -- isn't it common that users on a timesharing host can both run telnet at the same time, for example? -- but there are wrinkles in this architecture that make the straightforward solution inappropriate. Typical client/server Applications which use IP networks to communicate identify the connection via a 4- tuple of the local and remote host IP addresses and port numbers. In general, the host IP address determines which computer is involved, and the port number determines which program is involved, at each end of the link. Typical applications, such as tel- net, depend on contacting a known port on the server end -- for example, telnet uses port 23. A daemon process that listens to that port then creates an appropriate server which handles a client's inbound connection. Privileged daemons Unfortunately, this process requires that the daemon run as a privileged user under most operating systems, since it must be able to create the server process as the appropriate user -- otherwise, the server process could not access things that the user himself could access. If the server process was FTP, for example, the user would be unable to access his files unless everyone could. Ephemerality of servers Further, the server process that is created by this mechanism typically interacts only with the host operating system -- its files and so forth -- but does not then open addi- tional network connections. Finally, server processes tend to be ephemeral -- when the client network connection vanishes, so should the server. We have different requirements The architecture presented here is somewhat different. It is inconvenient to require that users running Yenta, say, also arrange to have their administrator install a privi- leged program in order to do so. Furthermore, such a privileged program would be tempting source for attack. For example, if all traffic passed through the daemon, it is potentially tappable at that point. And applications which use SSL to protect their communications -- as Yenta does, for example (see Section 4.8.1) -- cannot tunnel their encrypted data through the server, since the SSL architecture [63] does not per- mit this. The portmapper Instead, we run a port mapper service. The first copy of the application to be started on any given host starts listening on the well-known-port -- the WKP -- for the applica- tion. (In Yenta, for example, this is port 14990.) We shall call this copy of the applica- tion the portmapper. The portmapper's acquisition of the well-known-port prevents any other program on the system from listening on that same port. The application then forks; the other half of the fork then starts up as usual and runs the normal user application. Acquiring the well-known- port; registering with the portmapper Whenever any application starts up on the host, it attempts to acquire the WKP. If it succeeds, it forks as above, and one half becomes the portmapper. If it fails, then it knows that a portmapper is already running. In this case, the application scans the available range of ports until it finds one that is unused, and acquires it; let us call this port P. The application then registers with the portmapper -- it gives the portmapper its identity (in Yenta, its Yenta-ID -- see section 3.4) and the port it acquired. The port- mapper stores this value in an internal table. Inbound connections Any inbound application attempts to connect on the well-known port. It specifies the identity of the desired agent that it wishes to communicate with -- as above, in Yenta, this is the YID. The portmapper consults its internal table and tells the inquiring appli- cation to reconnect on port P instead. Handling crashes Applications try to reacquire the WKP at regular intervals. A success means that the existing portmapper must have died; the application that reacquired the port forks and becomes the new portmapper. Similarly, applications attempt to reregister with the portmapper at regular intervals; this enables a newly-started portmapper to rebuild its table. Denial-of-service A portmapper which acquires the port and then refuses to serve any requests -- or which provides incorrect data for requests -- is engaging in a denial-of-service attack; as we specified in Section 2.3, this is explicitly not a part of our threat model. (Pre- sumably, on a real timesharing host, other users of the application will list the sys- tem's processes, discover the true identity of the user running the malicious portmapper, and will complain vigorously to the perpetrator.) Security preserved Note carefully how this approach fulfills the goals required of our architecture. The portmapper contains no personal data -- agent ID's are public information. No per- sonal data goes to any third-party process -- the portmapper never sees the encrypted data stream between any two applications. No privileged process is required, and there is no single point at which security may be compromised. 2.13 Evaluation hooks Our final topic of this chapter concerns monitoring the operation of the system. The sample application described in Chapter 4 is a research prototype, and consequently it is valuable to have the ability to collect information from it while it runs. Other appli- cations might also benefit from the ability to observe their operation; such observation can be invaluable for locating architectural or implementation bugs, for example. In arranging such a monitoring capability, however, we must be careful not to undo the privacy protections that the architecture tries so hard to put in place. The sketch that follows details some of the steps involved, so as to complete our architectural description. Details of how Yenta arranges to be monitored are presented in Chapter 4. We assume that monitoring the running system can be accomplished by collecting statistics, from each agent, which detail what actions that agent has taken recently, whether or not it has detected any internal inconsistencies, and some information about its internal databases. Exactly what this information consists of is, of course, application-dependent. A central receiver -- a big problem? In order to allow these statistics to be analyzed, they must be accumulated in a single place -- a central receiver of statistical data. This is an alarming suggestions to anyone who has read Section 1.5: such a suggestion could potentially run afoul of all the problem of trust expressed in that section. The key is to arrange for anonymity of the collected data and confidentiality of its transmission. We shall examine these in turn. Anonymity In order for the data to be anonymous, there must not be anything in it that can be related back to a particular user. We already assume that there is more than one user in the system, from Section 2.2, which makes the most obvious attack -- knowing that all the data is from the system's only user -- infeasible. The particular application being run must also take care to sanitize its data, by removing as many personally-identifi- able details from the reported data as possible. For example, if the application handles messages between users, and it is important to see some of the contents of these mes- sages, the identities of the correspondents should not be transmitted. Preferably, the messages themselves should not be reported -- if what we care about is, say, the aver- age message length, then only the length of the message should be reported in the first place. This is analogous to the caution expressed in Section 1.5 about not collecting anything which you are not willing to have be the subject of a subpoena. The point of sanitizing the data is to eliminate the issue of having to trust the central server. This means that the central server can leave the accumulated data in the clear, on disks which might be the subject of an intrusion or subpoena, without compromis- ing users' privacy. Unlinkability must be what we are protecting It is very important that the sanitization process takes into account that some data is dangerous regardless of whether it can be associated with a particular individual. For example, data on how to build a nuclear bomb in one's backyard, using components from the corner hardware store, should presumably not be allowed to reside on the central server even if it is not possible to connect it with any particular person -- the mere disclosure of the data itself, due to compromise of the server, could have disas- trous consequences. Care is required of the application designer if data like this could be present in the system. Sanitizing the data is part of the solution. In many cases, however, one might wish to analyze the behavior of particular agents over time. It must be possible to determine unambiguously which agent is which, but it is presumably irrelevant exactly whose agent is the one reporting a particular item. In other words, we care about distinguish- ing agents from each other, but not in mapping them back to user identities. Random unique-ID's The solution to this problem is straightforward -- have each agent assign itself a unique identifier, not related in any way to anything else about the user (neither the user's identity, nor his characteristics), and report that unique identifier when sending data to the central receiver. This identifier should not be the same as the identifier which is a pseudonym for the user -- or any other identifier at all -- since the whole point is to make statistical data collection unlinkable to actual users or their online identities. For example, in Yenta, the ID we are discussing here is not the Yenta-ID. This unique identifier can be simply any sufficiently-random collection of bits which is long enough that accidental collisions (birthday paradoxes) are unlikely. For exam- ple, in any reasonable application, 128 bits is perfectly sufficient. If the data is sufficiently sanitized before transmission, and any identification infor- mation is restricted to disambiguating multiple agents from each other, then the data as collected at the central server is relatively safe. None of the threats mentioned in Section 1.5 present an insurmountable problem, because the data cannot be related back to anyone who could be harmed by its disclosure, and we are assuming that the data collected is inherently safe if its source is unknown. Confidentiality The remaining issue is confidentiality. It is insufficient to protect the data only once it arrives at the server, since an eavesdropper may be present between any given agent and the server. (Indeed, one of the best places such an eavesdropper could possibly be is right at the server, since all application traffic destined for the server will pass that point.) Such an eavesdropper could identify both the contents of the traffic and, for instance, the IP address of its origin; this could lead to disclosure of the mapping between any particular piece of data and the user who originated it. To protect users against this threat, the data in transit to the central server must be encrypted. Unless the application logs at different intervals or at different lengths depending on some confidential data, or unless the mere fact that a given user is running the applica- tion at all is considered confidential, this is sufficient to defeat eavesdropping of the contents of the transmission, and traffic analysis of the communication. Note that if merely whether or not someone is running the application is considered confidential, we may use a modification of the broadcasting solution of Section 2.10 to help. Rather than having every agent log directly to the central server, it could ask that its logging information be routed through n random other members of some clus- ter(s) before final transmission. The intermediate hops need not (indeed, cannot) decrypt the communication, and the central server (and any eavesdropper positioned there) has no idea where the logging information truly originated. If we are using this tactic, then the actual encrypted data should be encrypted with a public key whose corresponding private key is known only to the central server, and not to any agent in the system. Intermediate agents cannot then decrypt the data, and even an eavesdrop- per at the server who possesses the server's private key cannot, by the time the data is received, know where it came from. Central server is not a fundamental part of the architecture It should again be emphasized that the rest of the architecture present