Jump to Content
Keun Soo YIM

Keun Soo YIM

K. S. Yim is an experimental computer scientist working as a tech lead manager (software engineer) at Google. His current research interests are in on-device machine learning techniques for privacy and safety critical applications. KS holds 30+ United States patents on mobile/embedded system management methods and apparatuses, and has published 18+ technical papers on top-rank journals and conferences. He obtained the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign. Dr. Yim was a co-chair of the industry track of IEEE ISSRE (Software Reliability Engineering) 2017, and has served on multiple program committees of international conferences and workshops (in the fields of fault tolerance computing, software engineering, computer systems, and parallel and distributed computing). His personal page has all of his paper and slide files, and a full patent listing.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract This chapter explores the possibility of building a unified assessment methodology for software reliability and security. The fault injection methodology originally designed for reliability assessment is extended to quantify and characterize the security defense aspect of native applications. Native application refers to system software written in C/C++ programming language. Specifically, software fault injection is used to measure the portion of injected software faults caught by the built-in error detection mechanisms of a target program (e.g., the detection coverage of assertions). To automatically activate as many injected faults as possible, a gray box fuzzing technique is used. Using dynamic analyzers during fuzzing further helps us catch the critical error propagation paths of injected (but undetected) faults, and identify code fragments as targets for security hardening. Because conducting software fault injection experiments for fuzzing is an expensive process, a novel, locality-based fault selection algorithm is presented. The presented algorithm increases the fuzzing failure ratios by 3–19 times, accelerating the speed of experiment. The case studies use all the above experimental techniques in order to compare the effectiveness of fuzzing and testing, and consequently assess the security defense of native benchmark programs. View details
    TREBLE: Fast Software Updates by Creating an Equilibrium in an Active Software Ecosystem of Globally Distributed Stakeholders
    Iliyan Batanov Malchev
    Andrew Hsieh
    Dave Burke
    ACM Transactions on Embedded Computing Systems, vol. 18(5s) (2019), 23 pages
    Preview abstract This paper presents our experience with Treble, a two-year initiative to build the modular base in Android, a Java-based mobile platform running on the Linux kernel. Our TREBLE architecture splits the hardware independent core framework written in Java from the hardware dependent vendor implementations (e.g., user space device drivers, vendor native libraries, and kernel written in C/C++). Cross-layer communications between them are done via versioned, stable inter-process communication interfaces whose backward compatibility is tested by using two API compliance suites. Based on this architecture, we repackage the key Android software components that suffered from crucial post-launch security bugs as separate images. That not only enables separate ownerships but also independent updates of each image by interested ecosystem entities. We discuss our experience of delivering TREBLE architectural changes to silicon vendors and device makers using a yearly release model. Our experiments and industry rollouts support our hypothesis that giving more freedom to all ecosystem entities and creating an equilibrium are a transformation necessary to further scale the world largest open ecosystem with over two billion active devices. View details
    A Taste of Android Oreo (v8.0) Device Manufacturer
    Iliyan Batanov Malchev
    Dave Burke
    ACM Symposium on Operating Systems Principles (SOSP) - Tutorial (2017)
    Preview abstract In 2017, over two billion Android devices developed by more than a thousand device manufacturers (DMs) around the world are actively in use. Historically, silicon vendors (SVs), DMs, and telecom carriers extended the Android Open Source Project (AOSP) platform source code and used the customized code in final production devices. Forking, on the other hand, makes it hard to accept upstream patches (e.g., security fixes). In order to reduce such software update costs, starting from Android v8.0, the new Vendor Test Suite (VTS) splits hardware-independent framework and hardware-dependent vendor implementation by using versioned, stable APIs (namely, vendor interface). Android v8.0 thus opens the possibility of a fast upgrade of the Android framework as long as the underlying vendor implementation passes VTS. This tutorial teaches how to develop, test, and certify a compatible Android vendor interface implementation running below the framework. We use an Android Virtual Device (AVD) emulating an Android smartphone device to implement a user-space device driver which uses formalized interfaces and RPCs, develop VTS tests for that component, execute the extended tests, and certify the extended vendor implementation. View details
    Evaluation Metrics of Service-Level Reliability Monitoring Rules of a Big Data Service
    In Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE) (2016), pp. 376-387
    Preview abstract This paper presents new metrics to evaluate the reliability monitoring rules of a large-scale big data service. Our target service uses manually-tuned, service-level reliability monitoring rules. Using the measurement data, we identify two key technical challenges in operating our target monitoring system. In order to improve the operational efficiency, we characterize how those rules were manually tuned by the domain experts. The characterization results provide useful information to operators supposed to regularly tune such rules. Using the actual production failure data, we evaluate the same monitoring rules by using standard metrics and the presented metrics. Our evaluation results show the strengths and weaknesses of each metric and show that the presented metrics can further help operators recognize when and which rules need to be re-tuned. View details
    The Rowhammer Attack Injection Methodology
    In Proceedings of the IEEE Symposium on Reliable Distributed Systems (SRDS) (2016), pp. 1-10
    Preview abstract This paper presents a systematic methodology to identify and validate security attacks that exploit user influenceable hardware faults (i.e., rowhammer errors). We break down rowhammer attack procedures into nine generalized steps where some steps are designed to increase the attack success probabilities. Our framework can perform those nine operations (e.g., pressuring system memory and spraying landing pages) as well as inject rowhammer errors which are basically modeled as ≥3-bit errors. When one of the injected errors is activated, such can cause control or data flow divergences which can then be caught by a prepared landing page and thus lead to a successful attack. Our experiments conducted against a guest operating system of a typical cloud hypervisor identified multiple reproducible targets for privilege escalation, shell injection, memory and disk corruption, and advanced denial-of-service attacks. Because the presented rowhammer attack injection (RAI) methodology uses error injection and thus statistical sampling, RAI can quantitatively evaluate the modeled rowhammer attack success probabilities of any given target software states. View details
    Norming to Performing: Failure Analysis and Deployment Automation of Big Data Software Developed by Highly Iterative Models
    IEEE International Symposium on Software Reliability Engineering, IEEE International Symposium on Software Reliability Engineering (2014), pp. 144-155
    Preview abstract We observe many interesting failure characteristics from Big Data software developed and released using some kinds of highly iterative development models (e.g., agile). ~16% of failures occur due to faults in software deployments (e.g., packaging and pushing to production). Our analysis shows that many such production outages are at least partially due to some human errors rooted in the high frequency and complexity of software deployments. ~51% of the observed human errors (e.g., transcription, education, and communication error types) are avoidable through automation. We thus develop a fault-tolerant automation framework to make it efficient to automate end-to-end software deployment procedures. We apply the framework to two Big Data products. Our case studies show the complexity of the deployment procedures of multi-homed Big Data applications and help us to study the effectiveness of the validation and verification techniques for user-provided automation programs. We analyze the production failures of the two products again after the automation. Our experimental data shows how the automation and the associated procedure improvements reduce the deployment faults and overall failure rate, and improve the feature launch velocity. Automation facilitates more formal, procedure-driven software engineering practices which not only reduce the manual work and human-oriented, avoidable production outages but also help engineers to better understand overall software engineering procedures, making them more auditable, predictable, reliable, and efficient. We discuss two novel metrics to evaluate progress in mitigating human errors and the conditions indicating points to start such transition from owner-driven deployment practice. View details
    Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units
    IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2014), pp. 458-467
    Preview abstract In N-body programs, trajectories of simulated particles have chaotic patterns if errors are in the initial conditions or occur during some computation steps. It was believed that the global properties (e.g., total energy) of simulated particles are unlikely to be affected by a small number of such errors. In this paper, we present a quantitative analysis of the impact of transient faults in GPU devices on a global property of simulated particles. We experimentally show that a single-bit error in non-control data can change the final total energy of a large- scale N-body program with ~2.1% probability. We also find that the corrupted total energy values have certain biases (e.g., the values are not a normal distribution), which can be used to reduce the expected number of re-executions. In this paper, we also present a data error detection technique for N-body pro- grams by utilizing two types of properties that hold in simulated physical models. The presented technique and an existing redundancy-based technique together cover many data errors (e.g., >97.5%) with a small performance overhead (e.g., 2.3%). View details
    HTAF: Hybrid Testing Automation Framework to Leverage Local and Global Computing Resources
    David Hreczany
    Ravishankar K. Iyer
    Lecture Notes in Computer Science, vol. 6784 (2011), pp. 479-494
    Preview abstract In web application development, testing forms an increasingly large portion of software engineering costs due to the growing complexity and short time-to-market of these applications. This paper presents a hybrid testing automation framework (HTAF) that can automate routine works in testing and releasing web software. Using this framework, an individual software engineer can easily describe his routine software engineering tasks and schedule these described tasks by using both his local machine and global cloud computers in an efficient way. This framework is applied to commercial web software development processes. Our industry practice shows four example cases where the hybrid and decentralized architecture of HTAF is helpful at effectively managing both hardware resources and manpower required for testing and releasing web applications. View details
    From Experiment to Design - Fault Characterization and Detection in Parallel Computer Systems Using Computational Accelerators
    Ph.D. Thesis, University of Illinois at Urbana-Champaign (2013)
    Pluggable Watchdog: Transparent Failure Detection for MPI Programs
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IEEE (2013), pp. 489-500
    A Fault-Tolerant, Programmable Voter for N-Modular Redundancy
    V. Sidea
    Z. Kalbarczyk
    Deming Chen
    Ravishankar K. Iyer
    In Proceedings of the IEEE Aerospace Conference, IEEE (2012)
    Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
    Cuong Pham
    Mushfiq Saleheen
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    IPDPS (2011), pp. 287-300
    A Codesigned Fault Tolerance System for Heterogeneous Many-Core Processors
    Ravishankar K. Iyer
    IPDPS Workshops (2011), pp. 2053-2056
    Measurement-based analysis of fault and error sensitivities of dynamic memory
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    DSN (2010), pp. 431-436
    Quantitative Analysis of Long-Latency Failures in System Software
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    PRDC (2009), pp. 23-30
    A Software Reproduction of Virtual Memory for Deeply Embedded Systems
    Jae Don Lee
    Jungkeun Park
    Jeong-Joon Yoo
    Chaeseok Im
    Yeonseung Ryu
    Lecture Notes in Computer Science (2006), pp. 1000-1009
    CATA: A Garbage Collection Scheme for Flash Memory File Systems
    Long-zhe Han
    Yeonseung Ryu
    Lecture Notes in Computer Science, vol. 4159 (2006), pp. 103-112
    Operating System Support for Procedural Abstraction in Embedded Systems
    Jeong-Joon Yoo
    Jae Don Lee
    Jihong Kim
    RTCSA (2006), pp. 378-384
    A fast start-up technique for flash memory based computing systems
    Jihong Kim
    Kern Koh
    SAC (2005), pp. 843-849
    A Novel Memory Hierarchy for Flash Memory Based Storage Systems
    Journal of Semiconductor Technology and Science, vol. 5 (2005), pp. 69-76
    An Energy-Efficient Reliable Transport for Wireless Sensor Networks
    Jihong Kim
    Kern Koh
    Lecture Notes in Computer Science, vol. 3090 (2004), pp. 54-64
    Performance Analysis of On-Chip Cache and Main Memory Compression Systems for High-End Parallel Computers
    Jihong Kim
    Kern Koh
    PDPTA (2004), pp. 469-475
    A Space-Efficient On-Chip Compressed Cache Organization for High Performance Computing
    Jang-Soo Lee
    Jihong Kim
    Shin-Dug Kim
    Kern Koh
    Lecture Notes in Computer Science, vol. 3358 (2004), pp. 952-964
    An Energy-Efficient Routing and Reporting Scheme to Exploit Data Similarities in Wireless Sensor Networks
    Jihong Kim
    Kern Koh
    Lecture Notes in Computer Science, vol. 3207 (2004), pp. 515-527
    NIC-NET: A Host-Independent Network Solution for High-End Network Servers
    Hojung Cha
    Kern Koh
    Lecture Notes in Computer Science, vol. 3320 (2004), pp. 401-405
    A flash compression layer for SmartMedia card systems
    Hyokyung Bahn
    Kern Koh
    IEEE Trans. Consumer Electronics, vol. 50 (2004), pp. 192-197
    A Compressed Page Management Scheme for NAND-Type Flash Memory
    Kern Koh
    Hyokyung Bahn
    VLSI (2003), pp. 266-271