EVALUATION The primary contribution is the recognition of the importance of error-recovery in practice as well as the fact that they actually created a working system, demonstrating that such a thing is possible. I wouldn't have believed that one would be able to isolate the interface between the kernel and the extensions so well. The approach is practical and targets the right domain. The scheme also provides a great deal of flexibility in that once a well-defined interface is established nearly any access policy desired can be used. The entire foundation rests upon the benificence of the imported modules and the fact that there are no unpredictible accesses to the kernel. The authors readily admit this, but there must be some times when it is possible for accesses to not follow interfaces. ---------------------------------------------------------------------- Overall, I think that the work is very clear and well presented. The related work section was very helpful to understand and compare Nooks in the context of other research. The experiment environments were reasonably described together with the limitations. The system seems to be a great alternative for current systems, without incurring on a high cost in development and/or implementation. The price of obtaining this compatibility and transparency is performance and the lack of a robust fault tolerance. On the negative notes, the way errors are introduced in the experiments seem to be too artificial, some results in a real environment would be helpful. Also, the high rate of recovery (99%) is a little overstated since it represents only part of the experiments which are, on top of that, synthetic. The performance is good for the kind of device drivers that the authors used, but poor for other extensions. For better results in performance and fault tolerance, I think that a more aggressive approach is needed this in turn would probably imply a deeper change in the OS architecture and its extensions. ---------------------------------------------------------------------- Any work that can transparently retrofit a commodity system with new features deserves high marks, IMHO, even if there are some warts. OK, so their approach doesn't address deliberate malicious behavior, but that wasn't their goal. Also, they had to make some modifications to the kernel, so perhaps it isn't something you can plug right into your existing GNU/Linux system, but it's close. I reached the limit of my OS and Intel knowledge trying to figure out how they implemented their lightweight protection domains. Is everything in the same virtual address space, just divided into multiple segments, or does each piece of the kernel now share only part of its VM address space? (There is one VM space, but the write protection of pages changes as calls are made into and out of managed extensions.) ---------------------------------------------------------------------- The approach is technically sound, uses relatively little changes to the OS itself, allows mostly unmodified extensions to be loaded, and does not require adaptation by the extension developer. The emphasis on certain types of recovery is also a win since the whole kernel may not need to be rebooted. There are also weaknesses in the approach, not the least of which is the *60%* performance hit on a real-world application like the kernel httpd server!! XPC calls (that is, changes of control between the kernel-proper and the extension) are a real problem for Nooks. Other difficulties are the inability to control malicious faults and the fundamental nature of Linux's handling of interrupt exceptions. ---------------------------------------------------------------------- It is great that it is backward compatible to existing operating system and supports C-language extensions. Such strengths may make it propular. Nevertheless, the two core principles "Design for fault resistance, not fault tolerance" and "Design for mistakes, not abuse" may not be enough. If it cannot protect the system well, it may lead to security holes to the system. ---------------------------------------------------------------------- This system improves reliability in an extremely practical way. The authors have already identified several real bugs with Nooks. I find their use of fault insertion to support their claims to be dubious. It's not clear from the paper whether faults generated by fault insertion reflect real-world errors. They claim that their system catches 99% of errors, but it seems to me that most errors generated by fault insertion wouldn't actually show up in real systems in the same frequency. Furthermore, there may be certain classes of errors that exist in the real world that fault insertion may be unable to create, or might create only very rarely. It is my opinion that fault insertion can reflect the ability of a system to catch errors, but it does not reflect the kinds of errors it will find in practice. Also, it doesn't catch deadlocks. In my experience, kernel faults happen for two reasons: hardware faults, which Nooks can't do much about, or deadlocks in third party systems (such as IBM DFS on Sun Solaris). ---------------------------------------------------------------------- Their concentration is on driver faults and they motivate their arguments well by data which illustrates that drivers correspond to a large fraction of the OS code and that it is difficult to test driver variations. Overall the scheme would seem to work well for the targetted situation. But there are a lot of caveats. For instance, in my opinion, most network driver faults are caused by the inherent unspecified nature of the input to the driver. The driver's are not usually written to handle every kind of malformed packet. Memory issues in such drivers are rarely the problem. And so their claim that it is reasonable to ignore such faults is unjustified. Moreover, I do not fully understand how they claim that restarting the driver as and when required will put the system in a sane state. The driver would have interacted with others and would have caused system state changes all over the place. It is unclear to me as to what recovery would be achieved. Another point of concern is the performance overhead. At some places it seems like that nooks will result in overhead that is completely unacceptable. Given that I do not fully agree with their test fault cases, it would seem that this is too high a price to pay for cooked up faults. ---------------------------------------------------------------------- RESEARCH I think the crux of their idea is the interface between the two modules. Thinking about ways to construct nice interfaces that are theoretically sound and usable seems like the most interesting thing. Usable interfaces between untrusted components can be built that have stronger security properties would be quite desirable. ---------------------------------------------------------------------- The whole project is attractive for its size and clear definition and specifications. One of the areas where more research might be done is working in tuning or enhancing the performance penalty paid when using XPC. So some research can be done either on improving XPC itself or the way the kernel/extensions make use of it. ---------------------------------------------------------------------- I think this paper suggests two interesting avenues for future work: First, their isolation functionality is neat - how strong is it, really? How close can you get to implementing a reference monitor with it? Try making an LSM-based execution monitor into an RM using their method, if you can. Second, their repair functionality is interesting. How good is it, exactly? It'd be interesting to combine it with the "crash-only software"/microreboot approach to repair - see if you can architect your extensions to be easily repairable. I've also read about a technique for having a geeneric dummy driver temporarily stand in for a real driver that's being rebooted/repaired. It queues up requests and then replays them for the replacement real driver once it's ready. It might be interesting to throw this technique into the mix, too. ---------------------------------------------------------------------- There are a number of obvious extensions. First, trying to re-architect or optimize the system to reduce XPC overhead would be a major advance. Second, extending the approach to enforcing more policies than just accidental errors would be a huge win. Also, investigating how one could combine the unsafe, but (as the authors claim) well-written OS core with a type-safe set of extensions that could enforce stronger properties on kernel modules. This would represent a hybrid of Nooks and SPIN. Finally, continuing the study of how to more automatically (and with less impact) recover from certain failures would be a contribution. ---------------------------------------------------------------------- We may do research on designing a "fault tolerance" system, rather "fault resistance". (Mike: how exactly?) ---------------------------------------------------------------------- I believe that Nooks is capable of being a practical tool. However, I would like to see it be proven against older kernels and kernel extensions with known problems. I also wonder how hard it would be to implement Nooks for other operating systems in common use, such as MacOS X, Solaris, and Windows XP. ---------------------------------------------------------------------- The authors do state that their scheme is not intended to provide full reliability. And they make an attempt at partial recovery. But given the fact that they have successfully been able to inject a layer between the extension and the kernel, it seems like an ideal place to implement tracing services. Somewhat orthogonal to the current goals, but it would be nice to eplore the case of implementing IDS systems using their mechanism.