Cornell Systems Lunch

CS 7490 Fall 2015
Friday 12PM, Gates 114

Emin Gun Sirer and Robbert van Renesse


Sponsored by VMware

The Systems Lunch is a seminar for discussing recent, interesting papers in the systems area, broadly defined to span operating systems, distributed systems, networking, architecture, databases, and programming languages. The goal is to foster technical discussions among the Cornell systems research community. We meet once a week on Fridays at noon in Gates 114.

The systems lunch is open to all Cornell Ph.D. students interested in systems. First-year graduate students are especially welcome. Non-Ph.D. students have to obtain permission from the instructor. Student participants are expected to sign up for CS 7490, Systems Research Seminar, for one credit.

To join the systems lunch mailing list please send an empty message to cs-systems-lunch-l-request@cornell.edu with the subject line "join". More detailed instructions can be found here.

Links to papers and abstracts below are unlikely to work outside the Cornell CS firewall. If you have trouble viewing them, this is the likely cause.

Date Paper Presenter
August 28 A Fast Compiler for NetKAT
Steffen Smolka (Cornell University), Spiridon Eliopoulos (Inhabited Type), Nate Foster (Cornell University), and Arjun Guha (University of Massachusetts Amherst)
ICFP 2015
Steffen Smolka
September 4 Building efficient network dataplanes in software

In this talk we give a survey of solutions -- and especially, discuss the underlying design principles -- that we developed in recent years to achieve extremely high packet processing rates in commodity operating systems, for both bare metal and virtual machines.

By using simple abstractions, and resisting the temptation to design systems around performant but constraining assumptions, we have been able to build a very flexible framework that addresses the speed and latency communication requirements of both bare metal and virtual machines up to the 100 Gbit/s range. Our goal is not to provide the fastest framework in the universe, but one that is easy to use and rich of features (and still, amazingly fast), to taking away concerns on the dataplane's speed from (reasonable) network applications.

Our NETMAP framework, opensource and BSD licensed, runs on 3 OSes (FreeBSD, Linux, Windows); provides access to NICs, host stack, virtual switches and point-to-point channels (netmap pipes), running between 20 and over 100Mpps on software ports, and saturating NICs with just a single core (reported up to 45 Mpps on recent 40G nics); achieves bare-metal speed on VMs thanks to a virtual passthrough mode (available for Qemu and bhyve); and can be used with no modifications by libpcap clients.

Luigi Rizzo (UniversitĂ  di Pisa)
September 11 Verifying the correctness of remote executions: from theoretical possibility to near practicality

How can a machine specify a computation to another one and then, without executing the computation, check that the remote machine carried it out correctly? And how can this be done without assumptions about the performer (replication, etc.) or restrictions on the computation? This is a classic problem in systems security, and is relevant in multiple contexts: cloud computing, an untrusted hardware supply chain, etc. For decades, it has been known that this problem can be solved in theory, using probabilistically checkable proofs (PCPs) coupled with cryptographic tools. The rub was practicality: if implemented naively, the theory would be astronomically more expensive than simply executing the computation locally.

Over the last several years, a number of projects have brought this theory to near practicality. In this emerging area of _proof-based verifiable computation_, the pace of progress has been rapid, and there have been many encouraging developments: factors-of-a-trillion speedups, compilers that transform from high-level programs to executables that implement the protocol entities, and sophisticated implementations.

I will discuss the general area and my group's efforts in refining theory and building systems. A notable recent result is our Buffet system. Buffet has very good expressivity (it handles almost the entire C programming language as well as setups such as remote MapReduce jobs, database queries, etc.) and achieves the best performance in the area by multiple orders of magnitude. I will also cover the remaining obstacles to practicality in this research area; my hope is to communicate cautious optimism, with the emphasis on cautious.

Mike Walfish (NYU)
September 18 Microsoft Research India Overview
Established in 2005, Microsoft Research India recently celebrated its 10th year anniversary. We give an overview of research at MSR India. We cover our systems work (specifically, cloud security, privacy and language and tool support for machine learning) in some detail, and give a brief tour of our work in theory, machine learning and the role of technology in socio-economic development. We also present career opportunities at MSR India (we are hiring!), and encourage students to apply.
Chandu Thekkath & Sriram Rajamani (MSR India)
September 25 Computing Systems, from 300,000 BC to 2029
Computing machines have transformed the human condition and psyche more than any technology over the past half century, empowering billions of people beyond what even most despotic emperor could imagine just a few decades ago. But the computing systems we use today are built with 1960s-era technologies and suffer from engineering decisions made under constraints very different from those we face today. In this talk, I'll take a whirlwind tour through some episodes in computing's past that explain how we got where we are today, argue that only a fool would attempt to make predictions about the future, and make my own reckless pronouncements about what is (or should be) to come.
David Evans (UVA)
October 2 Industry-Academic Partnership Workshop
Please contact Jose Martinez for the poster session
Location: Duffield Atrium
IAP Workshop
October 9 Space Bounds for Reliable Storage: Fundamental Limits of Coding
We study the inherent space requirements of shared storage algorithms in asynchronous fault-prone systems. Previous works use codes to achieve a better storage cost than the well-known replication approach. However, a closer look reveals that they incur extra costs somewhere else: Some use unbounded storage in communication links, while others assume bounded concurrency or synchronous periods. We prove here that this is inherent, and indeed, if there is no bound on the concurrency level, then the storage cost of any reliable storage algorithm is at least f + 1 times the data size, where f is the number of tolerated failures. The lower bound is information theoretic - it applies to any storage algorithm, whether coded or uncoded. We further present a technique for combining erasure-codes with full replication so as to obtain the best of both. We present a storage algorithm whose storage cost is close to the lower bound in the worst case, and adapts to the concurrency level when it is low.
Idit Keidar (Technion - FuDiCo Workshop)
Towards Consistent and Private Large Scale Distributed Data Management Systems that are Functional and Efficient
Data is being generated, stored, managed and analyzed all around us at unprecedented rates. Large scale, globally dispersed data manipulation mechanisms are the common eco-system for managing such big data sets. However, managing large data sets in such settings raises significant consistency and privacy concerns that have resulted in sacrifices in functionality or efficiency. In this talk, we will explore approaches to mitigate the shortcomings incurred to ensure consistent data management as well as protecting data from an increasing number of non-trivial adversarial threats. The talk will relate some of the foundational works of the 80s to the future of extensible distributed data management systems.
Amr El-Abadi (UCSB - FuDiCo Workshop)
October 16 Internet At The Speed of Light
For the past two decades Internet service providers have focused on increasing the amount of bandwidth available to home users. But for many Internet services, such as electronic commerce and search, reducing latency is the key to improving the user experience and increasing service provider revenues. While in principle the speed of the Internet could nearly match the speed of light, in practice inefficiencies in the physical infrastructure and in network protocols result in latencies that are often one to two orders of magnitude larger than lower bound implied by the speed of light. Hence, we propose a challenge to the networking research community: build a speed-of-light Internet. This talk explores the various causes of delay on the Internet, sketches out two approaches for improving the physical infrastructure, and explores which applications will benefit most from reduced latency.
Bruce Maggs (Duke)
October 23 VC3: Trustworthy Data Analytics in the Cloud
F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. Mainar-Ruiz, and M. Russinovich
Microsoft Technical Report MSR-TR-2014-39. 2014
Kyle Croman
October 30 Embassies: Radically Refactoring the Web
Jon Howell, Bryan Parno, John R. Douceur
NSDI 2013
Matthew Milano
November 6 Exploding the Linux Container Host

The Container vs VM debate has been a pertinent question for years, but with the recent popularity of Docker, many have been keenly anticipating VMware's response. Their CEO's early claim was that Containers and VMs are "better together", but what exactly does that mean? Should we run containers in VMs, instead of VMs or even as VMs? In answer to this, Ben Corrie has spent the last year researching the most efficient way to run Containers directly on VMware's hypervisor and recently announced Project Bonneville. In this talk, Ben will take you under the covers of Bonneville to look at questions such as; how you can have a shared Linux kernel where every Container is privileged; how you can have containers without any Linux at all; and how VMware has brought dynamic resource constraints to the notion of a container host.
Ben Corrie (VMWare)
November 13 Hybrid Clocks for High Auditability
Work on theory of distributed systems abstract away from the physical-clock time and use the notion of logical clocks for ordering events in asynchronous distributed systems. Practice of distributed systems, on the other hand, employ loosely synchronized clocks using NTP in a best-effort manner without any guarantees. Recently, we introduced a third option: hybrid clocks. Hybrid clocks combine the best of logical and physical clocks; hybrid clocks are immune to the disadvantages of either while providing benefits of both. Hybrid clocks are loosely synchronized using NTP, yet they also provide guaranteed comparison conditions as in logical clocks (LC) or vector clocks (VC) even within the time synchronization uncertainty intervals of events.

In this talk, I will present two flavors of our hybrid clocks: hybrid logical clocks (HLC) and hybrid vector clocks (HVC). HLC finds applications in multiversion distributed database systems and enable efficient querying of consistent snapshots. HVC finds applications in debugging for concurrency race conditions and in lightweight causal delivery of messages. For all practical applications and deployment environments, the size of HVC remains only as a couple entries and substantially less than N, the number of nodes. I will also talk about higher-level auditability and availability primitives, detectors and correctors, that hybrid clocks enable.

Murat Demirbas (Buffalo)
November 20 ACSU Luncheon—no systems lunch, no meeting.
November 27 Thanksgiving Break, no meeting.
December 4 When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries

The ability to identify authors of computer programs based on their coding style is a direct threat to the privacy and anonymity of programmers. Previous work has examined attribution of authors from both source code and compiled binaries, and found that while source code can be attributed with very high accuracy, the attribution of executable binary appears to be much more difficult. Many potentially distinguishing features present in source code, e.g. variable names, are removed in the compilation process, and compiler optimization may alter the structure of a program, further obscuring features that are known to be useful in determining authorship.

We examine executable binary authorship attribution from the standpoint of machine learning, using a novel set of features that include ones obtained by decompiling the executable binary to source code. We show that many syntactical features present in source code do in fact survive compilation and can be recovered from decompiled executable binary. This allows us to add a powerful set of techniques from the domain of source code authorship attribution to the existing ones used for binaries, resulting in significant improvements to accuracy and scalability. We demonstrate this improvement on data from the Google Code Jam, obtaining attribution accuracy of up to 96% with 20 candidate programmers. We also demonstrate that our approach is robust to a range of compiler optimization settings, and binaries that have been stripped of their symbol tables. Finally, for the first time we are aware of, we demonstrate that authorship attribution can be performed on real world code found "in the wild" by performing attribution on single-author GitHub repositories.

Aylin Caliskan-Islam (Princeton)