CMSC 412: operating systems Instructor: Neil Spring TA: Aaron Schulman style: - whiteboard using emacs terminal - loosely set schedule - notes are intended to supplement. - cell phone quizzes - but not today - programming assignments are mandatory - fall behind, - late policy: one weekend, once. Note to neil = update syllabus for 7; check homework. = copy out the questions. your responsibilities: - slow me down. it is not your fault if I lose you. - read the textbook. edition 7 it shall be. - start assignments early. - ask Aaron or me for help. - there exists a forum. (link on web page) - there exists a web page. project startup. - grades.cs.umd.edu - csic linuxlab account. cs412078 and password. - ssh linuxlab.csic.umd.edu or linuxlab.cs.umd.edu - submit.cs.umd.edu - upload our finished code. - further instructions in project0 handout. (hey, that's online too). unauthorized collaboration - don't share code - don't print it. - don't mail it. - don't show it. - use xlock. (locking screensaver) - assignments: wikipedia is legal, sometimes flat wrong. sometimes wrong in the context of the class, often doesn't answer the question I asked. textbook things. 7th edition. ignore value judgements. "rich features" ignore windows sections. you're welcome to read them. summary sections at the end of each chapter. - reread these rather than highlight in the text. skim the first chapters. What is an Operating System? what does it do? piece of software between your code, application and the hardware. "arbitrary" hardware -- lots of devices, lots of implementations of those devices -- often just one or two processor architectures. devices: their own interfaces: - serial port likely to give, take one character at a time, - keyboard likely to give one character at a time. - printer's parallel port (/neil wonders about the difference) - video card (os treats this as a resource) - network adapter: probably gives us one packet at a time. - SATA... likely to provide a block from disk at a time. "serial ata" is the common hard disk interface as of 2005-ish. SCSI, IDE, other disk devices fit in here. - tape drive. three ways of getting at the info from these devices. - programmed i/o : write or read from an I/O port special CPU instructions for talking to these ports. - interrupt-driven i/o : devices can send a little signal to trigger the cpu (operating system) to run a routine that will read the data (or know to write more data) - direct memory access (DMA): attempt to avoid bugging the CPU with all the movement of data. resource allocation and sharing: resources include memory, cpu time, can enforce quotas on disk space. provides an illusion of an abstract machine to applications files, folders from file systems it manages. dedicated cpu by isolating processes from each other illusion that memory is infinite and linear through virtual memory that some devices that take characters at a time can take buffers or lines (write(serial_device_file_descriptor, "foo", 3);) provides mechanisms for processes to talk to each other. shared memory, pipes, locks (semaphores) priorities (some processes are more important than others) printf -> write explanation.. Application <- your code. Libraries <- make your code easy to write.implement a bunch of functions. ------- <- user space to kernel space boundary Kernel <- can munge devices, processes. Hardware <- CE people. that's you. :) "system call" is a function that traverses this user/kernel boundary. we'd like to minimize the number of these. rather implement all the important stuff in user space. printf is not a system call. - standard C library, pulls in the format string, interprets all those bytes, puts the stuff to write in a buffer. - then calls write() write is. - write takes a file descriptor (integer index into the list of open files) - takes a buffer (pointer to a region of memory) - and the length of the buffer. - copy the data into kernel space and send it where it belongs. How to tell the difference between a system call and a standard C library call. man pages are your friends. why is studying operating systems interesting. - for every application you write, you're going to rely on some operating system. and if you understand what it's doing, and how to instrument it, you can likely fix your application - if you mess with hardware, you might have to extend them. - with network protocols, - if you mess with hardware, you might have to do without an OS. - I hope you will appreciate the OS, despite the BSOD. - OS is the most important piece of software. - if it's buggy, you notice. - if it's not running, you notice. - It has power to do anything. - - Lots of choices: - "separation of policy and mechanism" - decisions to be made, - which disk blocks to read first? - which processes to schedule first? - Lots of tricky problems: - deadlock (processes waiting on each other) - how to use more than one processor. ------------ Lecture 2: 9/4 Homework 1 (revised) posted. on the web page. a) clean up general introduction stuff (virtual machine, microkernel) b) processes (chapter 3) What is a microkernel? - veryvery small kernel. - incorporate into the kernel only what is absolutely necessary - move functions into user-space, processes. - managing filesystem (flushing buffers) - network communication protocols - opposite: monolithic kernel - one big piece of software that has all the features Why would a microkernel be a good idea? - limited memory -- kernel memory tends to be wired down, immovable, can't swap it. - security -- small bits of code are easier to write, debug, and secure. less likely to have a vulnerability, scope of a vulnerability might be diminished. - reliability -- could restart the bits that fail, if they do. Why is it tricky, - can be slower, higher overhead, - cross that user / kernel boundary. -> could involve lots of copies. -> copying data is pretty wasteful - move data from one adress space to another. -> "message passing" Kernel modules - (?) can still write small pieces of code, - can still embed it into a larger kernel - can still expect some interfaces. What is an interrupt? - message, notification from a device or software that asks the operating system to do some work. (to handle the event) - device has stuff to be read, sends an interrupt, os runs some service routine in a device driver. - timer - while the processor is doing other stuff, there's a timer sitting on the side, and when the timer expires, interrupt! - useful for preemption - software interrupt - each interrupt (hardware, software) looks the same to the OS. - system calls. dos used int 13h, linux at one time used 0x80. - trap Batch processing. - pick up a job, run it to completion, then pick the next one. - programs encoded as stacks of punched cards. - programs are running on clusters of machines. Timesharing system: - pick up a job, if it doesn't finish in a quantum, give the next one a try - if the process yields the cpu by blocking on a device before its quantum is over, the OS would hand control to the next ready process Real-Time systems - every task has a deadline. - earliest deadline first is an ideal scheduling scheme - prioritize tasks that have the longest duty cycle. Embedded systems - we're supposed to not really notice that they're there. - run opengles :) - task specific (not general purpose) - controlled hardware (not a lot of drivers) - no (very few) hard disks - tend not to have an mmu - tend to run on very little power, in less space, smaller chips, fewer chips. - OS would have to be tiny, but needs to do less. What is a file? - named array of bytes, with metadata. likely stored somewhere - could include things that look like files: - device files - named pipes - /dev/kmem (and /dev/mem ?) where the array of bytes is the address space of the kernel (or of the machine) - /proc where the directories represent processes and the files represent a means of getting information about those processes. ls -al /proc What is a shell? - (just another) process that happens to help users run other processes. - connecting stdin, stdout, stderr to the terminal, - job control to suspend, background processes Process time. What is a process? - program - the passive bits. - has an address space - has at least one context (instruction pointer, registers, stack pointer) - resources (might be files, devices, shared memory segments) Process state machine: - in what states can a process be - Running (we're on a processor! hooray.) - Ready (wish we were running.) - Waiting (aka blocked) doesn't want to go to state running. - for some device, user, network, application to provide data. - Zombie - dead, not yet reaped. - preserved is the exit status of the process. - expecting the parent to call wait() (waitpid()). - new state -> ready - terminated <- zombie Process description is bundled up in a PCB - bad PCBs polychlorinated biphenyls. - good PCBs are process control blocks. - ip (instruction pointer) (program counter) - registers - scheduling history, priority - memory management info - process accounting info (how much cpu time each process has used) - on which device we're blocked (though that might be by what queue we're on) - includes pid, parent pid, all the other stuff that's listed by ps ------------ Lecture 3: 9/9 - Gimme homework 1 - Homework two due 9/16 - chapter 3. how does the processor wake the operating system up periodically to maybe context switch threads/processes or handle other periodic tasks. 8253 programmable timer chip. 1.19 mhz clock, and it's told by the processor how high to count. once that overflows it raises the interrupt request line "periodic" so it resets itself. 8259 programmable interrupt chip. 8 input wires "interrupt request" lines that would be raised whenever a device (in this case the the 8253) has something to say. 1 of the outputs is the interrupt line to the processor data bus to tell the processor which interrupt it tried to raise. - apparently, these chips can be cascaded. "master" 8259 and some slaves. 8086...386...etc. gets the interrupt, asks which one. 8253 wired up to interrupt request zero. Still in the land of processes. "context switch" context: set of register values. reference to the program data. switch: to achieve timeslicing or to get another process on the processors. step -1: you're in the operating system somewhere. you should've saved at least some application registers already. "Save_Registers in lowlevel.asm" else you don't get to use them. step 0: pick another process/thread to run ('dispatch'). better be in the ready queue. (otherwise it's blocked, done, or not "another") by some scheduling "policy" - the next one. (round robin) - the best one. (most important) put the running guy on the * and the new guy in state running. "in state running -> referenced by "current" pointer" * if we got here by the timer interrupt, back to the ready queue. * if we got here by a blocking system call, onto a device queue. take note of some stats: - accounting "stuff" - how much cpu time the process has used. - if the application used up the time slice, - might be frozen... - it probably is a background job and not that interactive, or important. don't have to schedule it so quickly. step 1: save the state. registers (incl program counter) [[ kill off other instructions in the pipeline ]] step 2: restore the new thread/process state. "processor affinity" - if you have a process running on a processor, might perform better if it stays there and isn't moved. - you might have a per-processor ready queue. if you're running the OS on a bunch of processors, there are locks. Scheduling basics: I/O-bound jobs: will run for a very little bit, then block on a system call (read, write) they use very little cpu and so you can schedule them quickly, and the sooner they start blocking, the sooner they might finish. CPU-bound jobs these guys use lots of cpu. they're probably not waiting for user input. they can probably wait. finish later. goofy science app. video transcoding, image processing. a goal is to mix these: i/o bound jobs don't contend with cpu-bound jobs. another goal is to make the machine seem fast. - prioritize things that the user cares about, that are interactive, that don't use a lot of cpu. a process that uses up its entire quantum might take a penalty to its priority. quantum: 10 ms (or so) timeslice allocated to a process when it gets scheduled. if you use it up: bump down your priority, leaving space for the I/O-bound jobs. What's thrashing? - a way of not getting any actual work done. - spend all of the machine's time paging. - two processes running, each require 2/3rds of memory to make real progress. - by "require 2/3rds" likely to touch pages in this memory footprint frequently enough To fix: (a) might increase the quantum #define HZ in linux (I think) give each process a longer time to make progress before getting kicked out. - might not fix it. - would likely make the machine less responsive. (a1) could imagine dynamically altering the quantum. (b) don't run java. or use java -Xmx option. (c) could kill stuff. - negative progress. - also don't get any work done. (d) might try to alter priority to give one guy an edge. (seems like (a)) (e) "medium-term scheduling" or "swapping" - push the second app to the side, and let the first one through. (specific case of d, in which priority of the second process is... "forbidden") probably seconds. short-term scheduling: the stuff we talked about before: which of those processes on the ready queue do we pick. long-term scheduling: which of the jobs we have to run will we start? (context: batch processing) might not run stuff until something else has finished. - might not start a process until load is below a threshold. - scripts on linux in /etc/cron.daily each of these jobs is run one after the other. Distinction: dispatcher vs. scheduler - dispatcher - the piece of code that effects the switch, puts another process onto the processor - scheduler - intelligence that decides which processes will get dispatched. Operations on processes: 1. creation kernel somehow has been given control of the processor. kernel decides to start the first user process: init what is the job of init. - reap the zombies with dead processes. - no user command. - init starts everything else. - basic processes associated with the console. - runs a script that will start the daemons. daemon: process with no interactive console that performs some service in response to some requests. examples: ntpd - network time protocol daemon httpd - blah blah inetd - tries to start network daemons on demand echod - hello. hello. sshd - now you can login. memcached - job is to cache information across processes how does init start another process? fork(); 2. deletion Once upon a time, people valued ease of writing software. system("/bin/ls"); system("make"); system("./a.out"); system is not a system call. how is it implemented? int system(const char *command) { pid_t pid = fork(); if(pid == 0) { // the child char *arguments[4]; arguments[0] = "sh"; arguments[1] = "-c"; arguments[2] = command; arguments[3] = NULL; execv("/bin/sh", arguments); // could fail. exit(EXIT_FAILURE); } else if(pid > 0) { // the parent int exit_status; waitpid(child, &exit_status, 0); return exit_status; } else { // damn. return -1; } } Lecture 3: 9/11 --- Homework issues. 1) don't make incorrect generalizations. we don't like giving credit to answers that have falsehoods. 2) just answer the qustion. if you want to show off that you have clue, being confident that you can answer the question concisely goes a long way. 3) please type up your responses. for the love of god, please type up your responses. (poor Aaron. it's not his fault I assigned all this.) C issues. 1) declarations and definitions. 2) nm (i386-elf-nm) Little process cleanup How do processes die? old age - they finish. whatever they wanted to do gets done: call exit(). they behave badly and are terminated. - unhandled exception thing: division by zero. illegal memory access. non-existent/illegal instruction. catastrophic failure: av williams power fails... shutdown: signal might get delivered. - signal not handled: SIGHUP SIGKILL SIGINT (depending on what the app decided to handle) - handle the signal, exit gracefully as if it's done. open question: kernel or shutdown command that gives the warning. Interprocess communication. isolate processes from each other through having different address spaces. isolate processes from the kernel through the user/kernel boundary. yet, we'd like to be able to do interesting things. why we want to be able to share data across processes: 1) example: might want to display images/stuff from one application/process in another (windowserver) process. 2) example: man shmat | more -- output of one process wired to the input of another. 3) example: notification (growl on OSX), logging (syslogd) - to provide a centralized and general error logging facility, many process can write their log messages to a daemon, that then can log those messages to disk or to a central server. 4) example: pool of processes coordiating with each other - qmail(?); apache maybe keeps cached documents in shared memory (?) ; why would we use inter-process communication rather than threads? (which share an address space, have inter-thread communication for (almost) free) 1) isolation: if something breaks, one process dies, (perhaps) no biggie. 2) reuse: shell bits (more/les), syslogd.; combine code/function written by different people. 3) might want some kernel per-process state to be different in the different pieces. - locks - different set of file descriptors: = select() - any time you call it, the kernel has to run through the whole damn list. num_of_fds, bitmask, bitmask, bitmask, time. = kernel might be happier if your many processes are calling this on fewer file descriptors per. - uid, effective uid, etc. <<<<<<<<<< important. - some task running on behalf of the user, and another task running for the system. - xserver, xterm. - syslogd, apache (www-data / nobody). 4) you might be able to page out a process more easily? if, for example.... find . -name "neils_hidden_file.*" | more it might be that "more" has nothing to do for a long time, and doesn't need memory. Baby interprocess communication. You could write to a file, let the other process read from file. - could be tricky to get the other guy to read the file... in - could be really slow - esp. if you have to wait for the disk or for a free buffer. Grown-up interprocess communication. Shared memory. placement of the same segment (set of pages) into more than one address space. - not necessarily at the same address in both address spaces. shmget <- construct / allocate / find the segment, implicitly creating if not existing. shmat <- "attach" / place the segment in this process's memory shmdt <- "detach" / un-place it. shmctl <- "control" / other. how do we pass data from process one (P1) to P2 with shared memory? both processes invoke shmget, shmat on the same segment. P1 writes some data in P2 can read it. how do we know in P1 if P2 has read the stuff (and thus is ready to take more data?) - separate the pieces of shared memory that will be written by P1 from the stuff to be written by P2. - P2 to write a sequence number, or a "pointer" to indicate that it has read the stuff. Message Passing send, receive messages. send to a group / wildcard. receive from an individual or from a group. synchronous or asynchronous send call waits until someone has received. microkernel architectures require fast message passing. "remote" procedure call within message passing. Pipes. two types of pipes: 1) "ordinary" not named. 2) named. Implement popen(" ") - library call that sets up one or two ordinary pipes between the calling process and newly forked process int popen_w(const char *command) { int fds[2]; pid_t pid; pipe(fds); pid = fork(); if(pid == 0) { // the child // has a list of file descriptors. // 0: stdin // 1: stdout // 2: stderr // want 0 to be the read end of the pipe. close(0); // optional dup2(fds[0], 0); // place the read end of the pipe onto stdin. char *arguments[4]; arguments[0] = "sh"; arguments[1] = "-c"; arguments[2] = command; arguments[3] = NULL; execv("/bin/sh", arguments); // could fail. exit(EXIT_FAILURE); } else if(pid > 0) { // the parent // int exit_status; // waitpid(child, &exit_status, 0); return fds[1]; // exit_status; } else { // damn. return -1; } } int fd = popen_w("less"); write(fd, "hello", 5); close(fd) wait(&exit_status); Exercise for the reader: figure out how popen works for bidirectional communication with the new process. how do the two pipes constructed from two calls to pipe(2) get associated with a single FILE *. What happens if the child doesn't read data? - if we write a lot of data to the child, we will block! bad. - there's some buffer, maybe 64k that the OS will hold on to, after which it assumes we should wait. - child process has to compress/encrypt data, it might be slower than the producer. What happens if the child dies before we're done writing data - sigpipe when we write more data. -- Named pipes allow the process to define a place in the file system where another process can find the pipe. - you don't have this parent/child relationship to the other side of the pipe. - you do have a one-to-one binding: at a time, there is one "client" and one "server" - you do not have many processes that want to talk to one. - not for syslog. - not for notification. -- Lecture 5 -- homework 2 due homework 3 out back to inter-process communication Review: (motivation for communicating between processes) Shared memory Message Passing Pipes: named and erm, unnamed? ordinary pipes. Sockets. Two types of sockets worth talking about - UNIX domain sockets. endpoint of communication is identified as a path in the file system a) the other endpoint of communication is on the same machine. b) can protect that endpoint with file system permissions c) each endpoint can ask the OS what user is associated with the other endpoint - use Internet protocol sockets. <- endpoint of the communication identified by IP address and port. this scheme can talk across machines -- to processes running on other boxes. bind() connect() read write etc. For both of these types of sockets, there are two models of communication (not being precise) datagram stream socket(AF_INET, SOCK_STREAM, 0); // -> TCP - read and write ordered streams of bytes. socket(AF_INET, SOCK_DGRAM, 0); // -> UDP - read and write (unordered) datagrams UNIX domain sockets allow similar interfaces: write a stream or write messages/datagrams. Can even pass a file handle/describe to the other process. specially formatted call to sendmsg (datagram-style call) - can give a process access to a file it wouldn't otherwise be able to see. (most useful?) - some initial setup in one process, followed by further work in another. Why choose UNIX sockets over IP sockets? - access control, ability to log users, etc. - may be more efficient - (could imagine using IP with the other side as localhost 127.0.0.1) MTU (when the data doesn't have to go over a wire) can be very large -- maximum transmission unit -- limit on the size of a datagram. IP limits you to 64k; - interesting question: whether unix datagram sockets will block rather than drop messages. Why IP over UNIX sockets? - want the ability to talk to different machines. - might run for now your database on the same machine as the application server as the web server, but later (after wild success and IPO) you might want to run these other processes on other machines. unanswered distinction between message passing (as defined in text, used in microkernels) and datagram sockets. - are unix datagram sockets connected? ("man 4 unix" might answer this...) - if not "connected" then, like a UDP endpoint, can receive from many senders. - you can connect a UDP socket. - (if connected, can't receive from anyone else, at least for udp) socket bind listen accept connect read write close shutdown socket - returns a file descriptor associated with some protocol endpoint. socket(AF_INET, SOCK_STREAM, 0); // -> TCP - read and write ordered streams of bytes. socket(AF_INET, SOCK_DGRAM, 0); // -> UDP - read and write (unordered) datagrams socket(AF_UNIX, SOCK_DGRAM, 0); kinda like "open" bind - takes that socket and associates it with some "name" or "address" or "port" some identifier for others to be able to contact this socket. might be that you'd call bind() without providing it any information, in order to get the kernel to assign it. in IP sockets: struct sockaddr_in { sin_len = ? sin_family = AF_INET; struct s_addr ipaddress; unsigned short port; // filler. } in unix sockets: struct sockaddr_un { u_char sun_len; // 106-ish? u_char sun_family; // AF_UNIX char sun_path[104]; // some path in the filesystem. not reused. }; "path" exists in the virtual file system, not actually persistently stored. (I could be wrong.) listen - indicates that we'd like to accept incoming connections, with some backlog queue. accept - given a file descriptor that is listening (and likely bind-ed), return a new file descriptor that references the new incoming connection. connect - given a socket and a remote endpoint, start up the connection on that socket. for datagrams, creates an implicit destination for all messages (don't have to say where the datagram is going on every call) for streams, well, now you have a stream. trick: one way of figuring out your own IP address: create a udp socket and "connect" it to some destination, then ask the kernel using "getsockname()" what the local endpoint was bound to. read, write - typically for streams send, recv, sendmsg, recvmsg - typically for datagrams close - ends the connection, no more socket. shutdown - allows you to end halves of the connection. takes flags: SHUT_WR - I will do no more writing. SHUT_RD - I will do no more reading. SHUT_RDWR - I will do no more reading or writing. (like close, except file handle not reclaimed?) nearly ends discussion of processes. what's the difference between a process and a virtual machine? - what's the differences in communicating between processes running on an OS and between virtual machines (vmware for now) running entire operating systems and their applications? - what's the interface between the process and the layer below and that sort of VM and the layer below? what's the difference between a process and a thread? - how does communication between threads differ from communication between processes? - threads share their address space (except their stacks) "free" shared memory. - processes have to explicitly set up shared communication. - processes can be run as (on behalf of) different users, with different permissions Why use threads? (i.e., pthread_create() instead of fork() ?) - switching between user-level threads may be faster than switching between kernel-level tasks (like processes and like kernel threads) sharing the address space might mean less paging when context switching. -- Lecture 6. homework notes on #2. threads: why, how done, user-level, scheduler activations, clone. I graded that. Posted on Grades.cs; Aaron will distribute them. Q1: 0x90. Q2: F -> flags, -100 -> high, N -> nice, [] -> kernel thread, no command line arguments, process swapped out. Q2e: one per processor. pinned on a processor. Q3: C Q4: Yes; timestamp on retransmissions is not changed. If there's a yes or no question in the homework, write yes or no, in the first sentence, or "it does". Or if you feel as if complete sentences are required, answer the question in the first sentence. Structure. Topic sentences. If you find yourself writing "the point is" in a sentence, it means that you didn't organize your thoughts enough to put "the point" in its rightful place in a paragraph. Q5: frequency -> amount of time you can spend doing each. domain of responsibility -> long term starts jobs. medium term evicts jobs for memory reasons short term multiplexes processes on a cpu. If there's a question, explain the differences between A, B, and C, I don't expect you to describe A and C, then in a final paragraph describe B. It looks a lot like an attempt to find the right paragraph in the text, paraphrase it, and then call that an answer to the question from the book. Don't waste our time. Question 5 largely graded on whether it appeared that you brought something to the question. If in ACB order and factual statements without focusing on a distinction, 1/2 credit. If short and clear, even if not insightful, full credit. Q6: save state; restore state. Q7: 8 Q8: a) parent/child relationship, get the output of a program, provide input to a known program, "ls | more" b) named pipe: no parent/child relationship, want to have a server of sorts collect input from other processes. (likely: can fake out a process that thinks it's writing to a file.) Q9: 5 --- Threads: Why threads: Instead of processes: - switching between user-level threads may be faster than switching between processes. - might be faster to switch between kernel-level threads. - cheaper for the kernel to manage a single process full of user-level threads than to manage many independent processes - can kill em all at once... (because you can't kill them individually) - as a programmer, easier to share memory (don't need that shared memory segment, just use your globals.), easier to share file descriptors, - could, if at user level, decide priority of threads. - potentially easier to spawn a thread than to fork a process. -- in windows api, can't just fork, have to create a process with an executable and permissions. -- hooray cygwin for kludging a workaround. Instead of no threads: - deadlock debate. - without threads, you have to be careful on an potentially blocking call. - with threads, you have to manage shared data, potentially with locks, might run into the four conditions of deadlock. - multiprocessors. - application that mixes io-bound and cpu-bound operations and could interleave the two. (objection: one could encode cpu operations in an event-loop-structured application) - can provide threads with independent responsibilities: - deadlock detections - error handling - user interaction stuff How to divide your application into threads: divide the tasks: one thread per role divide the data: image to process, split the image across the different threads. - keep the threads in balance (don't want one thread doing all the work) - avoid dependencies between threads (it does us little good to have a thread active but blocked) limits parallelism How do you make a user-level thread? What does user-level thread #1 have that user-level thread #2 does not? (restated, that thread #2 doesn't share, has a different one...) each user-level thread must have a - stack. - registers. context. We get no help from the kernel. how do we create a "new" stack? We have two functions at our disposal context can be saved using setjmp(). - yes, potentially more than once. context can be restored using longjmp(). - no. instead resumes from where the corresponding setjmp is located. To make a new stack, and thereby a new thread. setjmp call a function that uses 16k of automatic (local) variable space. recurses our "thread spawing" function again. longjmp back to the setjmp call, start our thread there. 2 tricks: 1) thread just started (higher numerically stack space) can't spawn a new thread in the same way, it would clobber later threads. instead only the thread at the bottom of the stack can spawn new threads. 2) threads can't "return", lest they mess up the stack of another thread. instead, have to longjmp to a different thread. 3) if the thread "exits", you have to put it on a free list so that it can take the next "spawned" thread. unanswered: why not malloc? (maybe you can?) Also have to make sure that otherwise blocking calls aren't. i/o calls have to be wrapped so that it is: { set file descriptor to non-blocking mode do the read or write. if it worked, golden (set the fd back to block; then return) if EWOULDBLOCK, manage sticking the thread on a waiting queue, yield to the next thread. (could, if nothing runnable, call select on all the waiting descriptors) } could also: { select( only this file descriptor, in the mode that matters ) if ready, do and return, if not ready, stick on the waiting queue. } Many convolved issues: 1) avoiding spinning. 2) avoiding starvation. 3) avoiding blocking. (as previously described.) POSIX threads: there exists an example in the text pthread_attr_init pthread_create <- given a function pointer pthread_join --- Lecture 7 -- hw 3 in. hw 4 out! :) hw3, question 1: next time, please don't use while(1) in the body of your 300+ threads. please use sleep(10) or so, instead. One more cleanup thing. strace -> mmap, mprotect, clone; mmap makes clear that each thread has it's own happy little stack in the single shared address space. two underlying system calls for malloc: - brk / sbrk() - which expands the heap, useful for small allocations. - mmap() - which gets some new pages somewhere near where shared libraries are loaded, or off the bottom of where the stack is allowed to go. Book describes the following models for user-threads matched to kernel-threads. - one-to-one: not really a user-level thread idea. each thread is a kernel thread. - many-to-one: not such a kernel-thread idea, all the threads are user-level threads. - many-to-many: some group of kernel-threads exchange the user-level threads, almost as if the user-level threads were run on many virutal processors. Thread pool - "free list" for threads. - x=malloc(48), free(x) - thread creation may be similarly expensive. - don't fork a new thread on every request, only to have it exit when the request is done. - carry around a pool of ready to go threads, each blocked on new work. - can be somewhat more advantageous if you're dealing with whole processes. (not called a thread pool) - can control the number of concurrent threads explicitly. Scheduler activations - worldview: user-level threads are cheap. kernel-level threads are necessary for performance out of MP systems. - kernel will "cooperate" with the user-level thread scheduler. provides a signal that another user-level thread can be given control. * processor allocation is the job of the kernel. * thread scheduling is done by the application. (advantage of user-level threads) * kernel notifies the application whenever there's something that affects the address space. * application tells the kernel when there's something important -- when there are more or fewer threads demanding processors. CPU scheduling. Goals in CPU scheduling: 1) utilization should be high. (not while(1) inside your thread high utilization.. want useful work) 2) illusion of responsiveness - short jobs should finish quickly (anything that's already going to take a long time doesn't need to be prioritized.) - partial results should arrive quickly. * try to make this analytical, provide some formula about minimizing waiting time or wall time over excution time, 3) fairness, no starvation Schemes: FCFS: first come, first served. if you have no preemption, and don't know how long any task will take, FCFS rules. SJF: shortest job first. suddenly we know how long things will take: can choose the shortest job: short jobs don't wait for long jobs, long jobs wait for short jobs. if you have preemption: shortest remaining time first - choose the one with the least time remaining in the job. Mode shift: - leave the massive scientific job on a batch queue space. - enter the multiprogramming world of alternating CPU- and I/O-bursts. - trick will be to estimate the CPU-burst length of each process individually. - essentially, schedule those with short bursts in preference to those with longer bursts. Scheme for prediction - in accounting at a context switch, update the estimator in the PCB (or the thread context) with how long that cpu burst was. - estimator tends to be an adaptive gradient / exponential average - ability to track an "average" without holding many samples. - Prediction_t = alpha * Measurement_t + (1-alpha) * Prediction_{t-1} Multilevel feedback queue processes will start in the queue that gives them the processor quickly, whenever they want it, but only for a short time. if they should take the entirety of their quantum, they should be relegated to the next level of queues. if not, maybe they get to move back up. queues further down may be given longer quanta than the higher queues - why? these are the cpu-intensive processes. - not going to run unless all the other queues are empty. - to give that process as much of forward progress as possible. - you'd rather give him 0.5 seconds once every 5 seconds than 0.1 every 1. - he'd rather have that allocation. - if we get down to switching between 3 processes in the lowest queue, there's nothing to be gained by switching between them often, and a little performance to lose, might as well let them run thrashing is the inability to get much forward progress done because you're switching too much. --- Lecture 8 -- thread, signal safe code; interactions between threads and signals. what's an example of a non-reentrant function? to which thread should a signal be delivered? * strtok - (a) it mangles its input (doesn't make it non-reentrant) (b) holds state -- you pass "NULL" to it a second time. strtok(const char *, const char *sep); << you could imagine this. ; would duplicate the string behind your back.... but then it'd ; have to release it. strtok(char *, const char *sep); << this is reality. "the" <- strtok("the quick brown fox", " .;:") "quick" <- strtok(NULL, ".;: ") * gethostbyname - "standard" hostname to ip address resolver function. - stores the result as a global, returns a pointer to that piece. * system calls that set errno also count. - system call fails, returns -1; if another system call were to succeed, errno would be cleared. (I could be wrong. Hooray, it's unspecified. all answers are correct!) if another system call were to fail for a different reason, then you'd be confused. If you were to call one of these functions inside a thread or signal handler, screwed you are. Conventional wisdom for writing a signal handler. Signal - means for the OS to notify a process of some event, possibly generated by another process. - implemented by assigning function pointers to signal numbers. - or, that a signal should be ignored. - or, that a signal should result in the process terminating. For example, if the process has written to a pipe where the other end has died. The OS might send SIGPIPE. - which then kills your process. - probably a good idea in general: cat /dev/zero | more - probably a bad idea for, say, firefox, that might not want to die just because some child process misbehaved. SIGINT - ^C SIGKILL - really go away. SIGHUP - the terminal hung up (good old modem days) SIGUSR1 SIGUSR2 - for whatever you want. SIGWINCH - signal that your pty's size changed (default action is to ignore) SIGALRM - (my least favorite) function pointer handed to the os is the signal handler. When handling a signal: set a flag. just set a flag. let the main thread of execution take care of whatever action was important. Don't try to do anything clever. Do no I/O. multiprocessor scheduling issues. scheduling for more than one processor at least two main issues. 1) you'd like to have them all working - want to keep the utilization up. (load balancing) 2) you'd like to keep processes running on the processor where they get the best performance. (processor affinity) these are or can be at odds. processor affinity. possible to have a mix of processes that will work best if a pair of them are pinned to the same processor. pair of processes with a bidirectional pipe, client and server. if you're able to recognize this relationship, can take advantage of it. might limit your load balance. large multiprocessor machines. (16-64 processors, SGI) - individual memories associated with each processor - ability to migrate processes and memory pages to other processors - but this consumes the ridiculously expensive and constrained backplane. - NUMA (non-uniform memory access) now a problem for everyone (not so much with NUMA) - Simultaneous Multithreading (SMT) aka Intel Hyperthreading - superscalar processor can maintain more than one thread context at a time, issue instructions from these thread contexts *simultaneously* - trying to address the problem that while running just one thread, you might waste a lot of time in memory stalls. - memory stall: ld a word from memory that is not in cache, it takes 60ns to pull from DRAM. in that time, what will the processor do? "nothing" or "speculate" - covering up memory latency - if you have instructions from another thread loaded into the processor, - that thread can use the functional units of the proc. - kinda looks like a multiprocessor (in that there are extra sets of registers, one for each simultaneously active context); but it doesn't have a copy of each functional unit for each thread. "real" schedulers. we described multi-level feedback queue. If you were writing a scheduler for a production operating system, what do you need? - could benefit from knowing whether the machine will be a desktop or a server system: - why? * desktop machine should be more interactive: goal is response time. * server machine's goal: throughput. might mean longer quanta. (maybe?: possibly penalizing cpu-bound jobs less.) - scalability matters: for a production operating system, you'd like to be able to handle thousands of processes. this means that any complicated O(n) scheduling algorithm would suck. * example: Linux had an O(1) scheduler. (so-called, maybe even true.). - each time the scheduler ran, it had some constant work to do. - applications: might have very important applications: window server for which "fairness" might be bad. might want to prioritize based on appliation importance. might have applications with specific requirements: a little cpu very often. media players. - users: may want to express their idea of priority of different processes. running using "nice" Spend some time understanding what might be fig 5.11 (solaris current priority -> ( priority if you expire, priority if you sleep, quantum length) Open question #1: In this discussion "processes", "threads", "tasks", "jobs", but not "user". If you were to run 100 processes on heaving, you would likely get 100x the CPU of any other user. Fair? * RLIMIT_NOPROC <- kernel enforced limit on the number of processes you can run. -- every should run up to that limit to fight fire with fire. -- (please don't) * I think there's a scheme for traversing the set of priorities that involves dropping your (and child's) priority on fork to sort of fix this... a bit. Open question #2: What if a non-interactive process has heavy I/O requirements? (many concurrent reads and writes from disk) how can you read more than one block from the disk at the same time? - non-blocking I/O. - reading a large buffer at a time . interactive process with meager i/o requirements (read my conf file) would get stuck behind the non-interactive guy. unless somehow the cpu priority worked it's way into the i/o scheduler. -- Lecture 9 -- Because of a small snafu with hw4, you can take until thursday. (but if you have it now, turn it in.) Snafu == incomplete (four-question) draft version of hw4 posted online. (there are, and will always be six). input file: something.script (show examples in a sec) main.c: the simulator, expects to be linked with a file providing Get_Next_Runnable <- finds the thread that you want executed next, pulls it off the runnable/ready queue, and provides it to the caller. Make_Runnable <- take either a new thread or one that is awakening, and expect it to go into the ready queue somewhere. output: 1) if verbose, print the tid that executes over each clock tick. 2) summary of each thread 3) overall summary statistics (wait time, how many cycles the idle thread got, how much cpu time was consumed.) draft score: (#of context switches + number of ticks to execute the script) times cpu time required. There are the Thread_Queue manipulating functions from geekos/list.h called-in by geekos/kthread.h. No mind-reading schedulers. Use only history to predict future cpu/io burst behavior. May widen the interface to report thread about to go to sleep on I/O. -- Chapter 6: Synchronization. with threads, and processes, and many processors on modern hardware lots of stuff can be processed at the same time: concurrency. the downside is that we have potential for race conditions. race condition -- any code, any action, where the durable outcome depends on timing, who gets there first, (or last). if the thread body is: { counter += 1; } 1. load the value at the counter 2. increment the value in the register 3. store the value back out to memory. if our thread is interrupted in the middle of this procedure, something won't work. either we'll increment a stale value, or maybe we'll store a stale value. simplest example ever. nearly impossible to debug. The reason we talk about this stuff (again) in this class is that operating system has these issues in spades: many kernel threads, many interrupts, both of these things create concurrency. pressure to be very efficient, fast, and exploit all the available concurrency. An option is to protect all kernel structures from any race condition with one, big lock. Pretty close to disabling interrupts any time you want to access stuff without being interrupted. (which is the approach in geekos). disadvantages: - limits concurrency: you could probably do other things at the same time, maybe one kernel thread wants to manipulate an I/O queue while another wants to manipulate memory pages. (or just different devices... or sockets... or something). - no separation between read locks and write locks - could extract more parallelism (perhaps) if there were non-exclusive locks. - there's more, we'll come back here.... maybe. The Critical Section Problem how to protect data elements by accomplishing the following four goals: (A) allowing only one process to be in its critical section at a time. (B) "progress": those procs not in their critical sections can't block those that want to be in. (C) bounded waiting: can't block indefinitely to get in. (D) not assuming anything about the speed or number of CPUs. Simple answer #1: disabling of interupts great, except: (1) applications can't (be allowed to) do it. (2) multi-processor systems. (other processors will keep at it) Simple answer #2: have some sort of "lock" variable. if it's zero, set it to one and claim the lock, when done, reset to zero. if we want the lock and it's already one, "while(lock);" great idea: except: (1) the lock itself needs to be protected. (more than one thread could see it be zero and set it to one at the same time) Simple answer #3: alternate. when I'm done with my critical section, I'll tell the next process, it can be his. let "turn" be the variable describing whose turn it is to enter the critical section. 1. while(turn != 0); 2. do the critical work. 3. turn = 1; // let the other guy go. 4. do our non-critical section work. could extend it to be turn = (turn + 1) % number_of_threads great, except violates "B" - if thread 1 doesn't want the critical section, it won't take its turn, preventing thread 0 from entering again. Peterson's solution Two processes only: my_id, other_id are 0 and 1 (unless you're in the other thread) Two shared fields: int turn; bool flag[2]; To acquire the lock / enter the critical section: flag[my_id] = true turn=other_id while(turn==other_id && flag[other_id]); // loop. To exit the critical section: flag[my_id] = false * If both call "enter" at the same time, one of those "turn=other_id" will stick. * the thread whose setting of turn sticks will cause the other thread to fall through and it will stay blocked. * the blocked thread will wait until the blocking thread leaves. If thread 0 acquires the lock, and is in its critical section, and thread 1 is elsewhere. flag[0] = true flag[1] = false turn=1 If thread 1 comes along: flag[1] = true turn=0 will spin until flag[0] is cleared. If both start at roughly the same time: flag[0] = true flag[1] = true turn=0 turn=1 while(turn==1 && flag[1]); while(turn==0 && flag[0]); hooray! ... flag[1]=false; get the lock again flag[1] = true turn=0 while(turn==0 && flag[0]); block hooray. for next time, why can't you extend this to three threads? -- Lecture 10 -- HW4 actually due. Amazon thing in 1115 after class (5p). Seattle is a superior place to be. All chapter 6 stuff, with some extra Plan: (dunno how much we'll get to...) bakery algorithm hardware instructions for synch spinlock example. Semaphores Producer/Consumer with semaphores Reader/writer with semaphores. Condition variables Monitors Bakery algorithm (not in the text... but I'll borrow from wikipedia) providing roughly the same as Peterson's solution, only for more processes. scheme "take a number" 1. choose your number. choose a number higher than anyone else if nobody's around (all numbers are zero) you'd pick 1. 2. wait until all the processes with lower numbers finish. 3. use your tid/index to break ties. // declaration and initial values of global variables Entering: array [1..N] of bool = {false}; Number: array [1..N] of integer = {0}; 1 lock(integer i) { 2 Entering[i] = true; 3 Number[i] = 1 + max(Number[1], ..., Number[N]); 4 Entering[i] = false; 5 for (j = 1; j <= N; j++) { 6 // Wait until thread j receives its number: 7 while (Entering[j]) { /* nothing */ } 8 // Wait until all threads with smaller numbers or with the same 9 // number, but with higher priority, finish their work: 10 while ((Number[j] != 0) && ((Number[j], j) < (Number[i], i))) { 11 /* nothing */ 12 } 13 } 14 } 15 16 unlock(integer i) { 17 Number[i] = 0; 18 } 19 20 Thread(integer i) { 21 while (true) { 22 lock(i); 23 // The critical section goes here... 24 unlock(i); 25 // non-critical section... 26 } 27 } ( this implementation assumes numbers don't wrap ) ( the only time that this would matter is if you always had someone waiting, because unlike the roll of numbers at the dmv, the numbers reset to zero ) Hardware instructions that matter. If we had only a single processor machine, these primitives might suffice. On the multiprocessors, we need some help: - owning the memory bus, so that only one instruction can perform an atomic read/write to acquire a lock. Two types of instructions: test-and-set swap (or exchange) TSL (test and set lock) testandset(int *lock_address) register <= *lock_address *lock_address <= 1 return register TSL eax, [memory_location] acquire: we want the event that our process *set* the lock. when the lock value goes from 0 -> 1, we got the lock. other possible case, at the end of test and set: 1 -> 1. while(test_and_set(lock) == 1); // spin. release: lock = 0; Primitive #2: swap. Swap(boolean *a, boolean *b) // all atomic, again. tmp = *a *a = *b *b = tmp acquire: k=true while(k) Swap(&lock, &k) // looking for lock 0 -> 1, which would put 0 in k. release: lock=false Has some downsides: (1) busy waiting. not only is it busy waiting, these implementations monopolize the memory bus. (2) could starve. more precise implementation of a spinlock in 386 assembly (also from the wikipedia) lock: # The lock variable. 1 = locked, 0 = unlocked. dd 0 spin_lock: mov eax, 1 # Set the EAX register to 1. loop: xchg eax, [lock] # Atomically swap the EAX register with # the lock variable. # This will always store 1 to the lock, leaving # previous value in the EAX register. test eax, eax # Test EAX with itself. Among other things, this will # set the processor's Zero Flag if EAX is 0. # If EAX is 0, then the lock was unlocked and # we just locked it. # Otherwise, EAX is 1 and we didn't acquire the lock. jnz loop # Jump back to the XCHG instruction if the Zero Flag is # not set, the lock was locked, and we need to spin. ret # The lock has been acquired, return to the calling # function. spin_unlock: mov eax, 0 # Set the EAX register to 0. xchg eax, [lock] # Atomically swap the EAX register with # the lock variable. ret # The lock has been released. Other bits about spinlocks: (1) busy waiting. (2) can be unfair. (can starve others) (3) various production operating systems will spinlock if they believe the lock is about to be released. (else they'll find a way to block) about to be released: (a) the lock is not intended to be held for a long time (b) the process holding the lock is running (maybe runnable)... not sleeping. - might be more efficient to spin on the cpu/bus for a little bit, rather than context switch. Semaphores: synchronization primitive with a counter and two operations. P (proberen, test, try, down) V (verhogen, increment, raise, make higher) counter (which is decremented by P, incremented by V) will not become negative. (only becomes as negative as the number of blocked prcesses) wait: (done atomically) while(semaphore_value <= 0); semaphore_value--; signal: semaphore_value++; what does it mean for a semaphore value to be larger than 1? -> counting semaphore. intended to represent some number of (roughly identical) resources. perhaps a number of buffers. alternative: binary semaphore in which a >1 value means a bug. note that signal can be called arbitraily many times. to have a critical section with semaphores wait(mutex) do the critical stuff signal(mutex) do the other stuff. to provide blocking semaphores construct a queue:. each semaphore: struct semaphore { int value; Thread_Queue wait_queue; } wait: (atomic operation, better disable interrupts in geekos) S->value--; // may become negative. if(S->value < 0) { Enqueue_Thread(S->list, g_currentProcess); Schedule(); } signal: (atomic operation, better disable interrupts in geekos) S->value++; if(S->value <= 0) { Make_Runnable(Remove_From_Front_Of_Thread_Queue(S->list)) } Classic problems = producer/consumer with a bounded buffer (first); reader-writer problem. bounded buffer: limited size, could imagine as a circular buffer. if full, producer should stop. if empty, consumer should stop. goal: block the processes that are not ready to run. can use counting semaphores to block the appropriate thread. producer: new_item = construct_our_item(); // no synchronization required // buffer must have some empty space for us to continue wait(&Semaphore_EmptySpace); // that empty space is ours! wait(mutex); insert(new_item); signal(mutex); signal(&Semaphore_SomethingThere); // now, if there was a blocking consumer, it would wake up. consumer: wait(&Semaphore_SomethingThere); wait(mutex); consumable = pop_item(); signal(mutex); signal(&Semaphore_EmptySpace); consume_our_item(consumable); brief introduction to deadlock: swap wait(mutex) with ... anything. Next time: reader/writer. === Lecture 11 === a) maryland voter registration deadline is october 14. b) cpu scheduler homework - a) due oct 16 b) updated distribution on-line 1) has a scoring component for handling priorities ( sigma(priority * cycles_waiting) ) 2) lacks sanity checking in geekos/list.h (speed!) #define KASSERT(x) assert(x) => #define KASSERT(meh) 3) your code has three minutes. c) sketchy turnin / leaderboard: https://kohoutek.cs.umd.edu/cpu/ c) midterm oct 21. review in class oct 16, review in discussion 20. may change to oct 23 unless someone sends me email objecting. reader/writer... "reader/writer locks" -> reads are not exclusive, but writes are. data that threads can all read simultaneously, but if it was to be modified you'd have to keep all the readers away. two classes of solution: 1) readers can't be stopped by a waiting writer. 2) readers can't start if there's a waiting writer. who's gonna starve? implementation using semaphores S_mutex->value = 1; active_readers = 0; S_writing->value = 1; writer: wait(S_writing) perform the write signal(S_writing) reader: wait(S_mutex) active_readers++; if(active_readers == 1) // if we are the first reader wait(S_writing) // if there's a writer present, wait for him to get out of the way. signal(S_mutex) perform the read wait(S_mutex) active_readers--; if(active_readers == 0) signal(S_writing) signal(S_mutex) == peterson's algorithm, bakery algorithm <- assume single processor machine. anything else (TSL, XCHG instructions) <- don't make that assumption, can support multiprocessors. if more, or more detailed arch questions, ask on forum or by email, I'll try to get them answered. == Condition Variables and Monitors. Condition variable: x.wait (will always wait) x.signal (will wake up just one waiting process) if nothing is waiting, nothing happens. Can be implemented using semaphores. Often combined with monitors Monitors: "wah. semaphores are too hard. I want something easier". Define a set of routines and data items s.t. only one routine can be active at a time. Java's "synchronized" stuff. Example: Bounded buffer producer / consumer using monitors. monitor OurBoundedBuffer condition not_full, not_empty; integer count; // number of things in the buffer. int N = BUFSIZE; insert(item) { if(count == N) wait(not_full); really_insert(item); // put the item in the buffer. count++; if(count == 1) signal(not_empty); // or, just signal all the time. } remove() { if(count == 0) wait(not_empty); retval = really_remove(item); // remove the item from the buffer. count--; if(count == N-1) signal(not_full); } end // if signal-and-wait -- I think this still works. remove remove remove ; insert insert insert // if signal-and-continue -- question is would more than one remove be awakened? // if on leaving the monitor, you unblock and immediately start any processes waiting on entry .... argh. still screwed if there's more than one thread calling insert. producer: while(not_done) item=really_produce_an_item insert(item) end consumer: while(not_done) item = remove() really_consume_the_item(item) end There are two schemes for what to do when the program calls signal: signal-and-continue -f you woke up another thread with your signal, keep running until you're done with the routine (or longer). signal-and-wait - if you woke up another thread with your signal, you let him run. Monitor implementation with semaphores Each procedure is replaced with P() wait(mutex) // give the mutex (semaphore as usual) actual body of P // if we have the mutex, we can execute the routine. if(next_count > 0) // if someone was just awakened? signal(next) // wake him up. he'll be in the monitor now. else // otherwise signal(mutex) // give up the mutex. no one is in the monitor. condition variable procedures: x.wait(): // condition variable wait x_count++ // increment the number waiting on this cond variable. if(next_count > 0) // if there's another guy to pass to (perhaps because he recently signaled us) signal(next); // signal him else // otherwise signal(mutex); // leave the monitor. wait(x_sem) // and wait on this condition variable mutex thing. x_count--; // we've been awakened and are no longer waiting. x.signal() // condition variable signal if(x_count > 0) // if there's someone waiting on this condition variable. next_count++ // a process is being awakened. signal(x_sem); // increment the semaphore to unblock the waiting process. wait(next); // signal-and-wait next_count--; // we've been signaled (signal(next) has been called).. now there's one fewer guy waiting on next Dining Philosophers n chairs, circular table, paired utensils s.t. each philosopher would need both left and right to eat. can't grab more than one at a time. hard to setup so that there is no deadlock: if everyone picks up the left side first, philosophers may starve. classic synchronization problem. A scheme for "solving" dining philosophers with monitors 1) give each philosopher a state: Thinking -> Hungry -> Eating 2) initial condition, everyong thinks. monitor the_dining_philosophers test(i): if i's neighbors aren't eating and i is hungry i <= Eating signal(i) pickup(i) i <= Hungry test(i) if i != Eating wait(i) putdown(i) i <= thinking test((i+1) % N) i==0 ? test(N-1) : test((i-1) % N) end == Lecture 12 == Midterm still for october 21. study early. :) cpu schedulers. hey, somebody please turn one in. my turnin is lonely. Today: transactions then deadlock. transactions. ACID. Two-phase locking. performance and concurrency. ACID - stands for the four properties of a transaction system Atomicity, Consistency, Isolation, Durability Atomic: either the transaction (all of its little operations) happens or none of it does. if all of them occur: commit. if none: abort. - might imply rollback, undoing any changes to the database that were made as part of the partially completed transaction. if anything goes wrong - say a server is busted, you run into deadlock, someone's account balance would become negative - abort. it's possible that you can't really abort, if the transaction had side-effects. may require a "compensating transaction" to undo a committed act. Consistency: integrity of data is preserved. invariants like a savings account never has a negative balance. as long as your operations preserve these invariants, the transaction processing system wont' break them. this is almost entirely a property of the individual transactions and business logic. Isolation: to every transaction, it should appear as if it is alone. as if it is the only transaction running in the system. serializability: there must be some order in which you can place the transactions that would have the same result as if they were all run independently. (even though they are running concurrently). obviously, if there is no concurrency, you have the schedule/order. the order is not necessarily the order in which transactions start or commit. Transaction1 (T1): a = read(x), a++, write(x, a), commit Transaction2 (T2): b = read(x), b++, write(y, b), commit could execute operations in the following order: t1read t2read t1write t1commit t2write t2commit will look like: T2, then T1 ran. y = original x + 1 x = original x + 1 if instead we said that T1 ran before T2, x = original x + 1, y = original x + 2. but T1 started first, and finished first. but in the virtual ordering of all transactions, it is as if T2 ran before T1. Two-Phase Locking 1. it avoids race conditions. 2. it provides serializability 3. if you don't release the locks until you're committing, avoid cascading aborts. growing phase - acquiring locks. shrinking phase - giving them up. Transaction1 (T1): a = read(x), a++, write(x, a), commit Transaction2 (T2): b = read(x), b++, write(y, b), commit (try to ignore the indentation, it's not well considered) t1_lock(x, reading) t1read t2_lock(x,reading) t2read t2_lock(y, writing) t2write t2commit t2_unlock(y) t2_unlock(x) t1_lock(x,writing) t1write t1commit t1_unlock(x) Aside: there's also something called two-phase commit, which has two phases, but shares nothing else in common. Durability if the transaction commits, it should be durable, it should last. even if the machine crashes. define the "commit" to be when the "commit" record hits the disk. write-ahead logging: log the transaction start log before and after values for every operation - after it's logged, we can update the data values in place. abort: log that we actually aborted, ensure that all the operations are undone commit: log that we committed, ensure that all the operations are done. checkpoint: for some point in the history of the log, all operations have been written to disk where they really belong. on recovering from a crash, walk forward to the last checkpoint, walk back to where the checkpoint pointed (the last update / record that might not have been flushed to disk.) replay any operation from a committed transaction re-undo any operation from an aborted transaction (anything that was not commited) when can the database / transaction processing system tell the client that a transaction has committed? - when the commit record made it to disk. no, I don't know how you can be sure it made it. probably that the database manages the disk and doesn't waste time with the OS file system and buffer cache stuff. when can you release the locks associated with a transaction? - simple answer: after the commit - ensures no cascading aborts. - slightly more nuanced: read locks can be released just before the commit. doesn't matter whether your transaction commits or aborts, the next transaction will still see the same data, so wouldn't have a cascading abort. which transaction do you kill if there appears to be a deadlock. imagine detecting deadlock simply by some transactions taking seconds kill the guy with the least stuff in the log. Deadlock. Intuitive definition: two or more processes that prevent each other from making progress. Book-like: processes that need resources and could ordinarily on the machine get all they need are prevented by process that holds resources (and vice versa) Also: processes in a set are all waiting for an event that can only be caused by another process in the set. Four conditions of deadlock: 1. mutual exclusion - can't share. 2. hold and wait - get some resource, hold onto it, while trying to get another one. 3. no preemption - can't take. can't steal. can't reclaim. 4. circular wait - there is a cycle in the waits-for graph. Three schemes for dealing: 1. deadlock prevention - break one of those conditions 2. deadlock avoidance - know enough about the processes that you don't enter an unsafe state. 3. deadlock detection - if you know it happened, kill off one of the processes involved Deadlock prevention: how do we break "hold and wait" - get all the resources you could ever want at once. - pretend there's only one resource. how do we break "no preemption" - preemption: a process waiting on a resource can have its resources stolen by another. - doesn't make sense for a printer - might make sense for memory pages or disk blocks. how to break circular wait: - construct a total order on the resources, and acquire them in this order. "protocol" that the applications agree to. anybody violates this protocol, you get circular wait again, you get deadlock. ---- Lecture 13 ---- CPU scheduler grading: 5 pts for across the board (on stock tests) better than round robin 5 pts "" quicker 5 pts for 2x better than round robin 5 pts for winning on any stock test. (I figure 5 pts for winning on your own test. 5 pts for a well-reasoned description of your scheduler 5 pts for a well-reasoned (and accurate) description of your script -- 35 pts! (but nearly equivalent to all other homeworks.) you can change g_Quantum. you can change it in Init...(). you can change it in Make_Runnable(). you can change it in Get_Next_Runnable(). (and of course, set g_needReschedule when needed) and there's one fewer off-by-one error in the distribution as of last night sometime. (described in the forum) due thursday, beginning of class. Review: Four conditions of deadlock. - circular wait - hold and wait - no preemption - mutual exclusion Deadlock prevention, avoidance, detection. Deadlock Avoidance: Ensure that at all times, every process has a way out (might mean waiting for some other process to finish) Downside: won't necessarily be able to use all the resources. - not living on the edge. Other downside: each process / app needs to tell the kernel what it might possibly want. r1 and r2 are distinct, unique resources. (not like a memory page, not something you have many of) P1: get(r1) .... get(r2) release(r2) release(r1) P2: get(r2) .... get(r1) release(r1) release(r2) P1: Maximum required resources: r1 r2 P2: Maximum required resources: r1 r2 almost like eliminating hold and wait except: you still hold stuff. you still wait. potentially even for something that the system has "available" but which would put the system into an "unsafe" state. Banker's Algorithm. A set of vectors. Available[resource_id] <= number of resources of each type that are available in the system. Max[process_id][resource_id] <= number of resources potentially required by the process. Allocation[process_id][resource_id] <= how many resources we've given out. Need = Max - Allocation <= how many more we could potentally want. Is-Safe-State subroutine: Work = Available Finish[process_id] = false loop: find i such that !Finish[i] && Need[i] <= Work if i exists then Work += Allocation[i] Finish[i] = true; // hooray goto loop; else if Finish[i] for all i then safe! // success! return true else not safe. return false end end Resource-Request: a) ensure that Request[process] + Allocation[process] <= Max[process] the process lied to us. it's asking for more than it said it needed. not clear what you do: could just adjust max (if that remains safe). b) ensure the Request[process] <= Available otherwise clearly can't satisfy the request: wait until that resource is released. c) pretend we allocated the resource to the process. if still a safe state, really allocate. if not, wait. (until resources are reclaimed or a process ends) If a process was allowed to run forever +) no deadlock -) starvation for the process waiting for the highlander to die. Deadlock detection and recovery. How can we tell the difference between - deadlock - waiting for a long time Compute the dependencies between processes s.t. any cycle exposes the set of processes that are deadlocked. - waits-for graph (nodes represent processes, edges represent a process waiting for another) If all our processes are waiting (deadlocked), we can do all kinds of stuff to figure out what the issue is. For users, we would like to prevent all deadlocks - when your mail program and address book program and .mac synchronization get together to do nothing useful. For operating system textbook writers starting from the history of multiprogramming jobs, the goal is keeping utilization high. Which process(es) do you kill? (expect to still run that process again) - indiscriminately - maybe youngest. may cost the least to restart that guy. - maybe oldest. still hasn't finished. what's he waiting for. What do you do if there's more than one cycle Memory Management: memory architecture is separated from virtual memory (using disk as backing store) stall: when your process tries to read memory, it might have to wait many cycles. goal: avoid this waiting. protection: keep user programs from accessing kernel memory, other process' memory, or even their own memory in bad ways. address space: the set of memory addresses a process can play with bottom-up growth of memory management: physical addresses -> logical addresses contiguously allocated -> logical addresses through paging --- Lecture 14 I think --- Midterm Tuesday. Review today. Format: ~24 multiple choice, ~6 short answer, ~2 longer answer (groups of questions) mc: "which of the following is not a process state" -- only harder than that, usually. short: define something; what are the characteristics. "what's the motivation behind monitors?" long: more likely fix this code, what's going on here. Midterm questions may include geekos stuff. mutex as equivalent to a binary semaphore (not a counting one) wouldn't necessarily have to be implemented with a semaphore Monitor implementation with semaphores: only one process can be running in the monitor at a time. - if a function inside the monitor is waiting on a condition variable, that process isn't "running" in the monitor... so it doesn't count. usually with condition variables (otherwise it's insufficient for concurrent programming) associate monitors with some shared data and the functions that access that data. - wrap functions in the monitor with some synchronization stuff. copied from line 1728. Each procedure is replaced with P() wait(mutex) // give the mutex (semaphore as usual) actual body of P // if we have the mutex, we can execute the routine. if(next_count > 0) // if someone was just awakened? signal(next) // wake him up. he'll be in the monitor now. else // otherwise signal(mutex) // give up the mutex. no one is in the monitor. expect: get mutex, run P, release mutex. if actual body of P was signaled (was awakened by another process in this monitor) we want wake him up instead. the mutex remains claimed. x.wait(): // condition variable wait x_count++ // increment the number waiting on this cond variable. if(next_count > 0) // if there's another guy to pass to (perhaps because he recently signaled us) signal(next); // signal him else // otherwise signal(mutex); // leave the monitor. wait(x_sem) // and wait on this condition variable mutex thing. x_count--; // we've been awakened and are no longer waiting. you already have the mutex before running either of these. x.signal() // condition variable signal if(x_count > 0) { // if there's someone waiting on this condition variable. next_count++ // a process is being awakened. signal(x_sem); // increment the semaphore to unblock the waiting process. wait(next); // signal-and-wait next_count--; // we've been signaled (signal(next) has been called).. now there's one } fewer guy waiting on next What are the parts of a process? - program - address space - thread contexts - resources (files, etc.) What are the things stored in a pcb? - pid. - much, much more. How is popen implemented? (not review, just an aside) bidirectional pipe on bsd (and mac, hooray!) only. :( property of the pipe() system call normal system v pipe is unidirectional: there's a write end and a read end. What must any solution to the critical section problem provide? - only one in the critical section at a time. - "progress" - those not in the critical section can't block the others. - bounded waiting - can't assume processor speed (relative progress of each thread) What's the first user process the (not geekos) kernel starts on boot? - init. runs in user space with user permissions. What's a thread pool? - don't create a thread per task, keep a bunch of threads around to service tasks. What's ACID? What are the four conditions of deadlock? What does test-and-set do? TSL instruction. What's it used for? What's a scheduler activation? - support many-to-many user-level threads to kernel-level threads, scheduler activation tells the user-level thread scheduler that a "new" "processor" is available when a user-level thread blocks. What are the system calls for a socket? What does signal() the system call, not the semaphore or condition variable or other operation do? Is system() a system call? What makes a function reentrant? How does deadlock avoidance work? How might you create a user level thread? - switching is done by saving register state (and stack pointer, instruction pointer) into some data structure, and restoring that state later. Which user level thread handles signals? (or how might a user-level thread package tell the kernel) Does someone call clone() to make a user level thread? Does someone call clone() to make a kernel thread? What makes it a microkernel? (besides smallness) Lecture 15 because the midterm doesn't count as a lecture. -- Midterm Review. I do not expect to hand back the midterm. Wednesday Oct 29 1:30-4:30 stop by AVW 4133 if you're really interested in seeing your exam. Stats: High 104 (tie) Low 36 (not a tie) Avg ~74 median around there too. Q6 which of acid isn't provided by the transaction processing system? atomicity is from the commits isolation is from the locking durability is from the log consistency is from the app not breaking invariants in the database. Q21 scheduler activations Q22 exec will replace the address space of a process by loading a new executable. Q28 just embarassing. shame. Q31 SMP is symmetric in what the processors can do. Q32 user level threads created by setjmp. Q33 reasons to use processes instead of threads. the answer is not: "because you don't need to" "share memory" isolation: a process can restart, seg fault, crash, whatever. permissions/protection: a process can have other permissions (drop them) distribution: if you think the processes will run on different machines. Q34 sorry about the wording; tried to correct on the board. null.exe is running in the background, which is the point of assignment 1. null takes its quantum -> golden. round robin scheduler. shell has to wait for null to finish its quantum Q37 can't tell at the beginning what resources you'd need. O(n^2) lots of processes don't terminate anymore. (no one wrote that) Q38 did it use the quantum last time? whether waking from I/O redundant. which queue was it in last? Q41-43 semaphores from monitors. - bit of misinterpretation: some answers involved geekos thread queues and waking up. - some also missed the "block" part. block means don't spin. while(S<=0); // never leave monitor you are screwed. 41: S=1 cvar = new condition_variable 42: S-- if(S<0) cvar.wait() 43: S++ cvar.signal() Q45-46 T1 :read(X ) T2 :read(Y ) T3 :read(Z ) T1 :write(Z ) T1 :commit T2 :write(Z ) T2 :commit T3 :write(Q) T3 :commit conflicts over Z give the order: T3->T1->T2. no; T3 gets read lock on Z. Q39-40 misc. PA3: not committed to running precisely the same scheduler as turned in in hw3. possible-or-likely that priority should matter. Chapter 8 things: Memory (not yet virtual memory backed by disk) what's a stall? (memory stall) processor is fine and happy, and it has to wait for slow memory to provide a word, you have a stall. memory protection: keep user programs from referencing (reading) or changing memory owned by the kernel or by other processes. the illusion we seek: run several processes at a time, each with the illusion of its own physical memory. address space: each process's view of memory. all the locations it can touch. Basic strawman scheme (but used) to run more than one program: could swap them in and out in their entirety. Similar scheme: "overlays" (from those happy DOS and CPM days) some program too large itself to fit in memory, you could actually swap out pieces of its code. one overlay for initialization routines, another for shutdown, another for ordinary use. different overlays for different "drivers" Instead, (as memories become larger than programs) we'd like to have more processes running at a time. *contiguous allocation load up an app at some address and tell the application that it has some size of memory that starts at address 0. behind the scenes, we'll have a "base" register to add to the application's memory address. and we'll have a "limit" register to ensure that the addresses don't exceed the space for that app. Picture: partition the physical memory into: [ kernel | app 1 | app 2 | app 3 | | app4 | ] ^ app1's base : ^ app3's base : ^ app4's base ^ app1's limit+app1's base : ^ app3's base+limit what would the base and limit registers look like to the kernel? 0, inf note: "contiguous allocation" totally applies. the region of memory used by app1 is contiguous. where does a new process go? first fit - lowest numbered space where the app would fit. best fit - smallest space where the app would fit worst fit - largest space (where the app better fit) can you share memory? you could share the top of an app with the bottom of another, but way sketchy you could have two threads share memory. if you add segmentation, (more than one base+limit register pair to play with) (or if it went through the kernel) ... but really, no. fragmentation: (remember fragmentation, it comes up in disk allocation too) internal fragmentation: your segment is made too large, unused inside. external fragmentation: spaces between allocations, wasted because they're too small for processes. compaction: taking and moving application address spaces without them knowing. (change the base register) Lecture 16 -- Section tomorrow magically converted to office hours. Aaron has nothing left to say. Midterm review office hours tomorrow afternoon (1:30-4:30) avw 4133. Homework 6 due Nov 6. Paging and Virtual Memory. Hardware support. Schemes for organizing the page table. Schemes for deciding which pages to evict. How to share pages and libraries. Difference between paging and segmentation: paging - fixed size "pages" that fit into page frames and we expect these pages to not be consecutive. shuffle the little ~4K pages in memory. segmentation - variable size, contiguous regions of memory in which the entire text, or data, or stack or heap, or library, will be stored. Why use paging? - limit external fragmentation - unused space between segments because all pages could be used. - our application can be larger than phy memory on the machine - don't need stuff deep in the stack all the time - initialization code - exception or error handling code - extra features in libraries - very large data sets that you might iterate through. - could create a large, mostly unused buffer. #define BUFSIZE 128000000 // if up against RLIMIT_DATA, might fail to load. char buffer[BUFSIZE]; // in the data segment.. or on the stack. read(s, buffer, BUFSIZE); // - our group of applications can then be larger (we can run many apps at the very same time) - we don't have to swap out whole segments, - the entire "working set" (actively used code and data) of all the apps is present at once. Process address space: Code || Data || Heap -> || emptiness || <- Stack emptiness likely replaced with shared libraries and other shared segments We need a Page Table per-process logical page number -> physical page (frame) number - likely one page table per process. *If* we imagine a 32-bit machine (processor with 32-bit addressing), 4K pages page number (which page) 20 bits in the page number. page offset (how far into the page) 12 bits in the offset. How many entries in the page table? 2^20. 2^10 * 2^10 =~ 1000000 entries. we store more bits in the page table: valid or invalid - if valid, logical page is in memory, the processor is happy, can proceed. if not valid, logical page is perhaps not allocated, is perhaps on disk. referenced bit - used by the "clock" algorithm to determine which pages to evict (could imagine more general bits for any page replacement algorithm) dirty bit - the page has been written (modified) and would need to be written to disk before being discarded. copy-on-write bit - (might be called "private" in linux parlance) if you fork() - flag the pages copy on write so that any (potentially unlikely) writes would allocate another page and anything shared would remain shared: no footprint to the new process. protection bits - e.g., execution but not writing, absolutely nothing unless all-powerful kernel (supervisor) mode. pin / lock - if you gave a device a reference to user memory, you don't to page it out until that device has read/written the memory. (kernel might not page itself, might set aside a region that wouldn't be paged...) but that user memory might need the bit set to be kept in memory by the pager. On every memory reference (ld, jmp, st), the processor generates some logical address that the MMU has to translate into a physical address. Rather than lookup in the entire table every time, we employ a fully-associative cache called the TLB "translation lookaside buffer" that includes very few entries - 64-ish? The intent is to avoid the page table lookup for frequently accessed pages. ideally, the page that has the code you're running would have an entry in the TLB. The TLB can be managed by the hardware or by the kernel. - (RISC-ish) a tlb miss can generate a trap that then requires the kernel to perform the page table lookup, decide which TLB entry to evict, insert the new TLB entry. - if the processor knows what the page table looks like (has a page table base register, at least), it can perform the lookup itself and insert an entry into the TLB. How do we store page tables? = memory gets bigger. applications/processes get bigger. - can use the straightforward, linear page table scheme with a million entries. - have a "tree" "hierarchical" can potentially have page frames of varying size. - hashed page table - inverted page table : if you want to access a page that's in memory, the search over all the memory pages might not be so bad. if you need a page that's in disk, you're screwed (10ms)... so what's the point of avoiding linear search? Lecture 17: projector failure; reconstructed approximate notes. -- Nov 4 is a cell-phone-quiz amnesty day. ( let the phones ring out... don't not vote or not volunteer because of class) hierarchical paging - might not want to hold the entire page table in contiguous paged memory. - logical address is a pointer to the outer page table, then the inner page table, then the bit in the page. - setup so that the page number is split again; a (potentially) 2^10 entry "outer" page table (aka page directory) points to the ordinary page table. - PAE - i386 scheme for allowing 4K and 4M pages by skipping the "inner" page table. 10 bits of address index the outer. the other 22 bits of address point to the big region. - can provide physical addresses that are longer (in bits) than the virtual addresses an app can get. hashed page tables - pretty straightforward, take your address, put hash it into an open(?) (with a linked list) table. inverted page table. - when the addressible space (64-bit) is much larger than the usable memory, too large for realizing the page table. - only store in the usable page table the virtual address corresponding to the physical address. - and the associated pid for the virtual. - linear search might be required. - sharing is not obviously supported, because the physical address stores a single binding to the logical in a process. Relocating libraries. Position-independent code: absolute object code without addresses. for the code to be really sharable between processes, it has to be loaded simultaneously at two different addresses. or, you have to decide globally at which address any library will be loaded, so that their allocation doesn't overlap. 1) relative addressing is used as often as possible. there are a few types of jump instruction: jump relative jump literal address jump register (indirect address) the goal is to eliminate jumping to literal addresses (that would have to be filled in by the linker) and instead use as many relative jumps as possiblew. 2) indirect addressing through a linkage table is used for globals or long branches globals aren't shared across processes, you know. 3) procedure linkage table and data linkage tables are kept in the process's data segment. === Swapping vs. Paging distinction. Know it. Page Fault: - the MMU traps to the OS when the app accesses a page marked invalid in the page table. -- might be invalid because it's totally invalid (NULL or garbage pointer) -- might be invalid because it hasn't been paged in. The Pager: - task: bring pages in from disk, mark them valid. (tell the kernel to restart the instruction that trapped) * Page Replacement - how do we find the "free" frames for when we want to page in something after a fault? - Mechanism (getting it done), Policy (which one to pick) a) write dirty frame out; when written (or the disk buffer has the data copied), mark it clean. b) find a clean, unused page, mark it invalid c) find an invalid page frame, claim it. Policy 1: FIFO That which has been in the vm longest, can be removed. - if you believe the stuff that was just brought in will be used for a while and then isn't useful. - expect important stuff to come back in again. - works less poorly if you don't actually evict the page (pretend to page something out, but keep it if referenced before replaced) Sequence: 1 2 3 4 1 2 3 4 1 2 3 4 if we imagine having only two pages, fifo would yield: 1 1 1 4 4 4 3 3 3 2 2 2 2 2 2 1 1 1 4 4 4 3 3 3 3 3 2 2 2 1 1 1 4 okay, let's have a better sequence. 1 2 1 3 4 1 3 1 2 .1 1 1 1.4 4 4 4 4 .2 2 2 2.1 1 1 1 .3 3 3 3 3.2 ^ so this could have gone better by evicting 2, the least recently used. 1) lru seems more likely to limit issues 2) see the loop? easy to implement fifo with a little pointer incremented each time. Lecture 18: -- Homework due Nov 6. don't panic. Cellphone amnesty day. Plan: Page replacement algorithms Page buffering Page allocation Kernel memory allocation. Suggestion for discussion section: c review, or that stupid thread queue list.h thing. FIFO: - first page in is the first page out, regardless of whether accessed. - pathological case for FIFO page replacement: sometimes a smaller phy memory can see *fewer* faults. - not the case that a larger phy memory would hold everything a smaller physical memory would -> Belady's anomaly. 1 2 3 4 1 2 5 1 2 3 4 5 <- sequence of accesses .1 1 1 1 1 1.5 5 5 5.4 4pg: .2 2 2 2 2 2.1 1 1 1.5 10 faults. .3 3 3 3 3 3.2 .4 4 4 4 4 4.3 1 2 3 4 1 2 5 1 2 3 4 5 <- sequence of accesses .1 1 1.4 .5 5 5 3pg .2 2 2.1 1 1 .3 9 faults. .3 3 3.2 2 2 2 .4 Policy #2 The fictional OPT: - make the correct decision based on looking into the future. - obviously, you don't actually know the future. - evict what won't be used for the longest time into the future. - apparently this local strategy works out to minimize the number of faults. another 3 page example. 1 2 1 3 4 1 2 1 .1 1 1 1 .2 2 .3.4 Policy 3: LRU - replace what has not recently been used - possible crappy implementation: stamp a counter on each page when accessed. - whenever you want to find the least recently used page, find the oldest stamp. - could instead maintain a sorted list (doubly-linked queue) - on an access, pull from the middle of the queue, move to the top - to reclaim a page, pull from the bottom. - ~6 pointer fixups to do... - is a type of "stack" algorithm -> no Belady's anomaly - the pages in a memory of size S are a subset of those in size S+1. Policy 3.9: FIFO + Second Chance - reference bit for every page, set on every access - if, on the FIFO replacement, R is set, put the page on the back of the queue as if it were new ("reset the time") - else (R is not set: the page is both unused and old), evict! Policy 4: Clock - like FIFO & second chance, but: - have the potential to have more than one "hand" - canonical implementation - still use the reference bit, - imagine a "hand" of the clock advancing through each of the pages, - clearing reference bits and evicting unused pages. - enhancement: figure out how to deal with dirty pages. - maybe another hand that writes out not-recently referenced (r=0) dirty (dirty=1) pages. Policy 5 & 6: - Least Frequently Used, Most Frequently Used. - least frequently used - maybe the stuff used a lot isn't useful any more, new stuff is. - most frequently used - more obvious: it's important. - would need to "age" these counters. - scheme: periodically shift off the reference bit - for each page, we have the reference bit. - at time 0+eps, scan all the pages, and copy the reference bit into a per-page value. and reset all the reference bits. - at 2 epsilon, add (?) the reference bit? -> 2, 1, 0 shift the reference bit on -> 3, 2, 1, 0 (not immediately clear how to compare "2" and "1" might be the count of bits. --- this is not likely to be the actual scheme used... - use this function of the history of reference bits to estimate most and least frequently used. Policy 7: Working Set - establish some \tao (t) interval over which any accesses are likely to be in the "working set" for the process. - for each page with referenced bit set, - note the current "time" - anything without the bit, we don't update the time. - we then have a list, for all the pages, roughly when they were last accessed. - to replace: any page with timestamp < now - \tao is not in the working set, can be evicted. can choose at random. - what if no such page? doesn't matter, evict at random. (would be cool to beg the kernel for a bigger allocation or swap yourself out...) Policy 8: Working Set + Clock - instead of choosing at random, you advance a clock hand. - if the clock loops around, pick a clean page to evict - can schedule the write of a dirty page Page Buffering: all of the page reclamation stuff was about taking valid pages and marking them invalid. other steps: 1) taking dirty pages, and making them clean. 2) taking pages that have data, and zeroing them. if we want to give a new page to a process, it better be full of zeroes. - if the paging device is idle, the paging daemon can just start cleaning pages (cleaning pages means writing them to disk). - can also keep track of evicted (but not yet zeroed or claimed) pages so that, if accessed again, it can be brought back, no disk access required. - we've already done all the work to choose this page to evict, but we haven't quite finished it off. can still be rescued. goal of this is to make the page fault handler speedy. Quick mincore() example went here. Lecture 19: mincore() runs of zeroes - - (a) we'll only fault on hitting the first page of the run of pages not in core -> gives us time to load up the other pages. - (b) we choose a contiguous region because it will be easier to write out to disk (hey, disk, here's just one 50K region, instead of here's 12 4K regions) -- it's possible / likely that some of these contiguous-in-logical-address-space regions are also contiguous in phy space. - (c) it could be easier to load them back in. Page allocation. Problem: which process gets how many pages? Minimum allocation per (running) process: depends on the architecture (instruction set) canonical example of extra-complicated is VAX you need the page(s) that the instruction is in. you need any pages referenced by the instruction with indirect addressing, you could have to load a pointer from memory and dereference that pointer. you could have memory-to-memory instructions. Strawman: - completely global allocation, no view of a process being entitled to any pages. - steal pages from wherever. - advantage: don't have to be too smart - clock operates over all pages. - disadvantage: active process gets to touch its pages, runnable/ready process isn't touching, so even its recenty used pages. - running process greedy with memory can effectively "evict" a runnable process from memory if it wants to. linuxlab likely unhappy. - (to me) unfair - the little process with a little memory footprint that doesn't use the cpu much (but maybe often) will have to wait to get his pages brought back in... - what happens while a process waits for its page to come back in? -> another process gets run. Fixed: - everybody gets the same number of physical pages. everybody gets k. - advantage: "fair" - disadvantage: you've "allocated" pages to processes that don't need them (assumes that different processes need different amounts of PHY memory) - could look at it as uniform, equal, since I think it does adjust to varying number of processes. Proportional: - allocate pages based on total memory allocation (how much logical address space it has mapped) - advantage of giving lots of physical pages to processes that want lots of memory - coarse: that address space isn't necessarily used. Priority: - if my process has a higher priority than your process: - your pages become mine. Global vs. Local - whether, when looking for a page, the kernel can choose a page that doesn't belong to the currently running process. - review strawman, a bit. Allocation to avoid thrashing (apply the working set model) - associated with each page is a "time" of last access, - any page accessed less than \tao ago is in the working set - any other page is not. - a question is how \tao is maintained - virtual timer is resilient to the process getting starved by the cpu scheduler. - real clock might not. - if our process ran through all the pages looking for one outside the working set and found none, - take an invalid page from another process. - what if none? - steal? - eat it (page out one of our own) - swap something out (maybe yourself) yield plus give up your memory. --- Lecture 20: Special Veteran's Day edition; observe veteran's day December 30! - note: december 30 is not the anniversary of the end of world war I. - the anniversary of the sinking of the uss monitor (1862) - formation of the ussr (1922) - execution of Saddam (2006) Grad school info session after class in 1122. why? what do comittees look for? what to expect? how to choose? how to apply? 5:00 Carl Kingsford - Introduction & Non-technical skills you need to succeed 5:15 Jennifer Story - The application and admissions process, and departmental support for grad students once enrolled 5:30 Bill Gasarch - What admissions committees look for, undergrad honors program, and picking a grad school 5:45 Vibha Sazawal - What a professor does / job prospects / research in software engineering 6:00 Neil Spring - Why to go to grad school, and a little on doing research in systems and networks 6:15 Amol Deshpande - The importance of doing research as an undergrad, and research in databases 6:30 Michelle Hugue - How to pick a grad school and "how to have a life as a grad student" 6:45 Graduate Student Panel Memory-mapped files. The backing store for a region of memory can be: 1. the source of all zeroes 2. swap space (a block device whose sole purpose is to hold pages) 3. a file on the file system (also mmap a device?) Not the same as memory-mapped I/O. whole different thing. (explanation here) Memory allocation in the kernel. Constraint: we'd like to be able to manage large contiguous regions to hand off to devices. we may not be able to page stuff out. (strong may -- very likely that we can't page stuff out) Buddy system: start with a large contiguous block. interface: as malloc, free. implementation: store a table of lists of free blocks of sizes equal to powers of two. 1M 512K < > 256K < > ... 4K < > < > 2K 1K 512 allocate: a) look for the size in the table, if found done. b) else break a larger-sized allocation. free: a) put your block back. b) if your block has its buddy, can reconstruct the larger allocation. Slab allocation in kernel, we have many structures - open files, mutexes, pcbs, network connections, skbuff. could imagine a malloc and free instead: allocates space for an array of these small, homogeneous objects. limit the waste due to internal fragmentation by packing objects. advantage is that we can allocate quickly because there's a region of memory sitting there waiting. relatively little of the benefit of the slab allocator comes from being able to re-use allocation. most of the benefit of the slab allocator comes from re-using initialized objects. interface: allocate ( kmem_cache * ) ; the cache knows the size and type of the object free ( kmem_cache *, the object ) kmem_cache *cache_create( char *name, size_t size_of_object, int alignment, void (*constructor)(void *, size_t), void (*destructor)(void *, size_t)); back end: caches can be full, partial, - should try to fill or to empty the partial ones empty - could release this cache to the system if there's mem pressure other kernel memory allocation issues: can you allocate memory inside an interrupt handler? (sure if it was a trap/system call) it can be the case that your request for memory inside the kernel *could* be satisfied if only you could clear some buffers out to disk or rearrange memory in some other way. but to do that would require waiting or giving up locks, which you might not be able to do. in linux kernel: there are two flags to kmalloc - one if you'd rather fail than wait. (you're holding locks) (GFP_ATOMIC) - if you'd rather wait than fail. (you're working for the application) (GFP_KERNEL) in kernel land, stacks are very small: should avoid allocation on the stack. end of kernel memory / high performance memory allocation prepaging when swapping a process in, try to bring in all the pages it is likely to need and not wait for it to demand page everything in. page sizing: why have small pages? less internal fragmentation, more compact use of memory might be able to find an unused smaller page more easily than an unused large page (fine grained) smaller than ridiculously large pages would transfer off disk more quickly (time to seek the disk head vs transfer time) why have large pages? fewer page faults -> you bring other good stuff in, especially for sequential access. fewer levels in the page table -> smaller, and fewer access on a tlb miss tlb reach -> you can have more of memory accessed without a tlb miss. How does all of this alter writing applications? (a) locality - when your application touches memory, we don't want to jump around. - i'm curious if you had a very large two dimensional array, what's the performance difference between iterating the right way and the wrong way. - i'm curious if you could write a generational garbage collector that would reduce paging. - i'm curious how, when writing a web proxy cache, the cache makes a decision about what stays in memory and what if anything goes on disk. Lecture 21: How does all of this alter writing applications? (continued) If there's lots of space, might your pages be written to disk anyway? Lisee: all the time on windows. When might this be bad? your application happens to have a plaintext password in memory, unencrypted private keys. anything you don't want the FBI to know about.... if the box crashes and somebody takes it. How much swap could you possibly need? 4GB times max processes - minus the code, the pages are straight off disk anyway. The OOM killer. WindowsXP expands page file. Linux uses a static sized swap partition CW if the swap partition is large enough, the machine will fall down paging before it runs out of memory. But likely not that large (or processes don't have such a large overall working set) Attempt to pick a victim and kill it. Individual processes have resource limits; if a process exceeds its limit, it's killed. (or malloc fails) How does it choose the victim? unlikely to be a good idea: ask the user. Kill them: - large memory footprint - low priority (niced) - new processes (not wasting a lot of work... they might be to blame) - may be a good idea: those with large VSZ to RSS ratio (lots of stale pages) -> leaky program - lots of children (fork bomb.) Don't kill them: - small memory footprint - important supervisor / root processes on the assumption that they're essential. - exceptional case: if it's pages were provisioned for access to hardware (you can potentially give user processes direct access to hardware -- pages shared) (copies cost time) - the process that locks your screen (i.e., xlock) (parameterize the oom killer with user feedback.) FILE SYSTEMS. what's a file? named array of bytes persistently stored across reboot typical. some extra metadata what's a directory? collection of files and directories what can you do with a file: delete/unlink exec() it stat() - query for the metadata create() open and close() read and write() seek() - advance or return to a specific position for the next read/write seek + write provides append() clone/copy (usually implemented as open of the new thing, read the old one, write the new one). link - create another name for the same data. what can you do with a directory: opendir - given the name of a directory, provide a file handle for reading each dirent (dirent = directory entry combining name and some metadata) readdir - closedir - seekdir no writedir (you must use open or link to add entries.) this is the code in opendir(2) for determining if a file with name "name" is in the current directory ".". struct dirent *dp; size_t len = strlen(name); DIR *dirp = opendir("."); while ((dp = readdir(dirp)) != NULL) if (dp->d_namlen == len && !strcmp(dp->d_name, name)) { (void)closedir(dirp); return FOUND; } (void)closedir(dirp); return NOT_FOUND; How does this stuff look on disk? What's a disk? magnetic disk many platter. mechanical head head is fixed while the disk spins. cylinder, cylinders carved into sectors Ultimately we want the interface of blocks. Once upon a time it was tremendously important to know the geometry of the disk. when disk read/write scheduling could yield performance. Now we have the block interface we want - locality still matters. Things we have to put on the disk boot block. <- whatever information is required to tell the bios/cpu what to load partition table <- take a large disk and carve it up into separate file systems. number of blocks, the size of those blocks, etc. directories file control structures (inodes) the blocks of the files. Things the kernel must have in memory to operate on files. partitions on the disk file systems that are mounted and where they go. open file table list of all the files that are open, including position (offset) (you might have the same "file" in the open file table twice) per-process file table. file descriptor -> entry in the open file table. Metadata stored in an inode: owner, group permissions (read ctime (creation time) mtime (modification) atime (access) size direct, indirect, doubly-indirect blocks not actually the file name.... you couldn't have hard links if so. not actually in an inode, but you could store type information (typically filename extension, first bytes of the file, resource fork) Lecture 22: The evolution of file system implementations. review: file systems. operations over file systems unix file systems separate data (blocks) from metadata (inodes). what's the difference between a hard disk circa 1980 and a hard disk 2008? capacity now huge. physically much smaller. spin faster. less time waiting for the bits to arrive under the head less time waiting for the requested bits to pass under the head. move the head faster. fazter seek. access time is faster, but still limited by physical stuff. cache on a disk drive. ask for block i. once you get block i. ask for block i+1. -> ideal would be to have a cache of the cylinder being read so that you don't have to wait for the platter to make a complete revolution before getting block i+1 -> alternate is to use knowledge about the disk to lay out the blocks so that the next one hasn't passed under the head when you're likely to ask for it. file system usage conventional wisdom - most files are small. - most files are short-lived. clearly files in /tmp, .o file that gets overwritten or replaced as you edit (esp. if you don't save) - most bytes in large files. TODO: find counterexample metaphor just to make this seem not entirely intuitive. how to allocate space on a block device for a file. goal: have blocks that are in the same file in the same place. strawman: contiguous allocation - store in the file header (inode) where the file starts, and how it is. - good: simple random access. - good: sequential access should be fast. - bad: might not even fit (not have available contiguous blocks) external fragmentation - bad: might not be able to append; if the file grows there might be no place for it. slightly more real: linked allocation - store in the file header a pointer to the first block. - block is conceptually "linked" to the next block in the file. - good: we can grow files. - bad: random access is terrible because you have to traverse the list. (and the list is on disk with a pointer in every block) - bad: terribly unreliable... if you lose a block, the rest of the file is gone too. :( - good: the blocks that are "free" can be in a file too. except you don't want to do this... why? there's no easy way to identify contiguous free blocks. there's no easy way to layout a file in contiguous blocks. - maybe: could insert an integer number of blocks in the middle without wrecking everything. FAT: file alloation table. take linked but collapse all the pointers into the FAT. good: linked list free block thing (if we want to, which we don't) good: sequential access can be fast. good: random access can be fast IF the FAT is cached. bad: ? doesn't resist failure well? bad: ? FAT doesn't always fit? Indexed allocation: filename maps to an array of pointers to blocks. good: all of the block numbers associated with a file are in one place. bad: you have to decide the max size of a file, and it will be small. Multilevel indexed allocation: the last few pointers to blocks don't point to data blocks, instead point to blocks full of pointers to blocks. instead point to blocks full of blocks full of pointers to blocks. good: small files are well handled, in the inode, good: large files are still handled mostly sort of well. bad: random access over large files might not be so good. don't have the indirect blocks in cache. Early unix file system allocate the outer cylinder to inodes. one big array of inodes. inode number is an index into this array. directory is a list of (name, inode number) pairs that is otherwise a file. good: directory scheme is simple; straightforward version of multilevel allocation. bad: unhappiness in large directories. (linear search) bad: separating the data from the metadata. if we want to look through a directory, many long seek. bad: I've decided how many files I can store. (array of inodes) Fast File System creates "cylinder groups" which are inodes and blocks located near each other. try to keep inodes referring to blocks and inodes in the cylinder group. (why can't you just put inodes wherever you want) allocates data blocks first within the same cylinder group then adjacent, then anywhere... Lecture 23: P4. amend course late policy to permit 2 uses of the one weekend late. if you haven't used any, late for p4b, p5; if you have used one, this one's free. P5. work in progress. http://www.cs.umd.edu/class/fall2008/cmsc412/project5x/ if materials aren't out by monday, I will be apologetic. More Fast File System. Key piece: cylinder groups. Also: ability to store fragments. You've setup a large block size. -> internal fragmentation. Take a block, break it up into, say, four fragments, store the last fragment(s) of a file in this block. EXT2 structures on disk: inode: mode, owner, size, timestamps, 12 direct blocks, indirect, double indirect, triple indirect. superblock: magic: EF53, version, mount count, block size, number of free blocks, free inodes, first inode (as opposed to our GFS2 convention of 1) group descriptor: block and inode bitmaps. for each of the block groups (renamed version of cylinder groups) feature: preallocation: imagine appending to two files into the same directory at the same time. (syslog, and messages both in /var/log) how would you expect to see blocks allocatd? the naive scheme for file A and file B would be: AABB________ AABBA_______ AABBAB______ AABBABA_____ AABBABAB____ AABBABABABAB instead: AABB________ AABBA...____ AABBA...B... AABBAA..BB.. AABBAAA.BBB. AABBAAAABBBB NTFS Master File Table: 1-4KB, may include the entire file. Directory is a B+-trees. (B+-tree has all the data in the leaves.) Can load a block at a time (the expensive part) Treat searching through the block as free. To refer to blocks of data on disk, NTFS uses "extents" extent: -- no extra disk seeks for indirect blocks when walking sequentially through a file. could construct sparse files. (can still do this in other filesystems, but seems nicer here) Can still have indirect-block-like lists of extents. How we track free space on the disk? In GFS2 and some actual file systems, we'll store this information in a file. That info stored as a bitmap. (not as a list) Why prefer a bitmap over a list of free blocks? can find the contiguous ranges of free-ness. The Buffer Cache. Between the kernel supporting read() and write() functions and the block device that is the disk is the buffer cache. Like all caches, we have fast memory, slow disk, and an expectation of locality. will take requests of the form: "give me block #n", "mark this block as dirty", "release this block" behind the scenes, the buffer cache request will sleep your process waiting for the block to be retrieved. behind the scenes, the buffer cache will have its dirty blocks written back to disk. exactly where they came from. (role of updated -- for some operating systems there's a daemon process in charge of making sure dirty blocks are written back.) Short-lived files might not leave the buffer cache. (not a will not leave... not a will never write to disk for any of the writes required to create a file.) Yes, inodes are in the buffer cache. Remember to mark things dirty. GeekOS implementation stores the buffers in a linked list, moves recently-used buffers to the front. But how to support mmap(). traditional scheme: is to have a page cache that lives on top of the buffer cache. allows you to map a buffer into user space (requires a copy) modern scheme: a unified buffer cache which combines the page cache and the buffer cache. for the interested: http://www.usenix.org/publications/library/proceedings/usenix2000/freenix/full_papers/silvers/silvers.pdf Log structured file system insights: * access time for a disk is going to remain high. * throughput increases (assuming you don't have to seek) bit density, rpm. disk will start to look like tape. * memory on a machine kept increasing -> lots of read cache. don't have to read anymore. write for durability. - let's write our file system as if it is a log. bundle our writes (the blocks, the inodes) into contiguous segments (~1MB) at full disk throughput rate. - We can update the inode so that the new location of an old block is current (in the new segment). - First problem: where is my inode? - wrong approach: if we let the number change, we would have to write the directories that reference that inode (might be able to make this work, but not the scheme that two really smart guys came up with. mendel rosenblum, john ousterhout. (fun flamewar if you search for ousterhout and lfs implementation)) - ifile. mapping from inode number to block on disk that stores the inode. - the ifile has an inode. - store the location of the last ifile inode in the superblock. close to a checkpoint. - Second problem: we will fill up disk this way. - There is a seguse file which is the file that tracks which segments are free. - (a) we have to ensure that there's some place to start replaying the log from making it possible to GC segments "before" that point in the log. - (b) take partially-filled segments, and rewrite their data into new segments. we take this old data and make a segment full of old data. ideal we will not have to clean this segment, because the data blocks seem useful and are not short-lived. - "the cleaner" - is one place where LFS's get screwed up. - Third problem: what is the most frequently changing part of an inode. atime. - store the atime in the ifile. or you could noatime. (which saves you from some power wastage on ordinary file systems.) Lecture 24: project 5: uploaded an empty disk. have inconsistent states: ufs/ffs fat ext2 delete -> could crash in between writing something and something else (list of free blocks, list of free inodes, directory) move -> hopefully not lost, but then again, hopefully not linked twice incorrectly. if we were to crash and reboot afterward, figure out what happened -> recover a consistent state. "don't" have inconsistent states: log structured - the operation did or didn't go through : if the checksum on the segment is good, it worked. journaling - write metadata (inodes and stuff) into the journal, log of our operations, often not the actual data. Journaling define: metadata to be inodes, directories, free block maps, free inode maps, any of the structures associated with maintaining the file system. anything stored that is not data. concern about consistency: inode reference count could be made to be wrong. allocated block that is in the free list. unallocated block that marked in use. could have written the inode before writing the data (u r screwed) Three flavors: writeback: write our metadata into the journal, data elsewhere, whenever. -> could lead to inconsistency == 1) it's not so bad (the inconsistencies aren't problematic, esp. if directories count as metadata; 2) we live on the edge for speed. ordered: same deal, except write the data first, metadata second, attempting to guarantee consistency. data: add data to the journal: not only do we store the operations, we also store the data. not quite so bad, in the sense that your disk head doesn't have to move so much can have a log that permits undo not entirely sure why can have a log that permits redo might not have to read (but probably have to for other reasons) can have a log that permits both ext3 (I think) I/O scheduling: Possible to have more than one read or write request outstanding for the disk: how? - many processes - each process wants to read and write, cp example. - with 8 concurrent cp's: expect 8 read block and 8 write block requests at the same time. (write likely not blocking.) - crazy array of not blocking writes. - nonblocking reads. (prefetch) Our interface to the disk is: give me this block, I will wait for you to provide it. FIFO : get a request, send it to the disk. first in first out. why does FIFO suck? SSTF SATF SPTF: shortest seek/access/positioning time first greedy starvation SCAN / Elevator / LOOK favors the blocks in the middle (could use it.) C-SCAN SCAN, but only one direction, doesn't favor the middle (but I guess the extra long seek) Deadline scheduler every request goes into two queues: 1) to ordinary queue hoping that C-SCAN is enough 2) to a read fifo or a write fifo. if a read has been waiting of 0.5 seconds, just issue it. if a write has been waiting for 5s, just issue it. (not 5s from when the app decided to write, 5s from when the buffer cache decided to write back) Anticipatory scheduler look into the future, maybe don't move the arm. predict whether you can reduce the amount of time to seek by waiting for the app to make a request for a nearby block. Lecture 25 Plan for the future: Finish disk stuff (raid, mostly) S12.7. NFS. Basic security stuff: unix-y, then BAN logic of authentication (not in text) Distributed systems: Lamport clocks 18.1, Two-phase commit 18.3, Byzantine Generals 18.7 P5: revised test suite posted. RAID Disks are physical, they fail. people put them in laptops and the laptops fall. friction eventually causes wear. power surge / bad power. air inside the disks -- disk head floats on top of the platter if that air gets dust, that changes the airflow, and can make the head crash. could also totally lose data by the software/os/whatever breaking. research by Dawson Engler on model checking filesystem implementations. (question on whether journaling avoids the troubling bits.) could also totally lose your data through various catastrophes too numerous and horrible to list. see fallout 3. Disks are cheap. Disks are slow. redundant array (inexpensive, independent) disks. Levels 0, 1, 5 know. others recognize. Level 0: stripe. take more than one disk (e.g., 2); each block appears on only one disk. useful for: throughput. we can stream write twice as fast, stream read twice as fast. not useful for redundancy. probably more dangerous. Level 1: mirror. take more than one disk; each block written to all/both. useful for: redundancy also: parallel reads. don't need to read from both disks to make sure data is correct level 2: error correcting codes (7 disks) write the actual data bitwise, then include error correcting codes on extra disks. 4 bits worth of data and 3 bits of error correcting info. Error Detection: parity bit. can know that there is an error, don't know where. Error Correction: can tell where that error is. 2-dimensional parity. (neil's favorite) 1101-1 1011-0 0110-0 1001-0 ----- 1001 0 level 3: almost always, disks know when they break. have the additional information of which bits are broken (the disk told us), can use the parity to reconstruct. level 4: takes one disk, makes it the parity disk, but records block-level parity. to write, read from the disk that stores the block we're about to overwrite, read from the parity disk the parity for the block we're overwriting recompute the parity (flip any bit that we changed in overwriting the block) then write both back. hammers the parity disk. Level 5: block level parity scattered on different disks. rotate parity duty to each disk in the set. still need to read/modify/write, but the overload is shared. level 6: a bit more redundancy, enough to handle two-disk failure. improve write performance? level 0. (maybe 5, could write some stuff in parallel, but you'd have to read first... and that's not fun.) Very end of disks 1) there are some really cool graphs in the textbook showing the technology trend of cheap, large disks. cost per megabyte keeps dropping. 2) the story of the physically shrinking hard disk is the first and possibly most compelling story in the "innovator's dilemma" (christensen). briefly retold. NFS : network file system we would like to share. we would outsource our backup (disk fail) could read about it in the text... paraphrased from Based on a Sun implementation of *remote procedure call.* main idea is that you want to provide a function-like interface because it's easy to program. don't expose the messages that are being sent, only functions that return. Machine independent XDR (external data representation) some machines use LSB, some MSB (and there are other differences); we have to agree on a representation for integers and such on the network. (could have chosen alternate schemes like receiver-makes-right) Server is stateless. server does not care what files you have "open". it only cares about the reads and writes. client can crash and restart and the server shall not care. How does NFS know whether to allow a, say write. Might like the idea of your file server authenticating users and allowing those users to do from their client anything they can do locally. NFS does not do that. NFS trusts the client to accurately report the user-id of the requesting process and assumes that the uid is the same on the server as on the client, and if the uid could write the file on the server, the write is allowed. How does an nfs server prevent me from here at this desk asking to write a file as root on nfs-server.example.org? a) list of allowed machines is whitelisted at the server b) pray that the network prevents other machines from impersonating a whitelisted machine. recent implementations allow indirection to change uid at client to uid at server binding root_squash! take any request on behalf of uid 0 (root, superuser) and pretend it came from uid -1 (nobody, unpriv). anecodotal: when the server goes away for a while, the client becomes very unhappy. RPC layer tries to paper over message loss. common practice in networks is to wait longer each time. if you sent a message and it didn't arrive, it might be because the network is overloaded. if so, we should all send more slowly/less often. why it matters: while NFS was designed for local workgroups, wide area network is expected to have more failures, delays, and losses. (could try to fix some of these issues, but ... not really in the model to expect failure in the filesystem-made-remote). hard and soft options to nfs. -- Lecture 26: Security day. Salt, password file. uids. Classic stories. Ken Thompson's hack. Morris Worm. Tenex hack. quick review of keys first. Burrows-Abadi-Needham Logic of authentication. How does a unix system know that you're you? 0) it doesn't because you hacked it not to care. * 1) password. 2) you have a key (ssh identity i.e., private key) 3) krb. kerberos. 4) secureid (calculator thingo) two-factor two-factor means something you own and something you know. In each case, there is some process in charge of performing this verification and giving you a shell (process) running as you. PCB has a uid. How do we convert from happy user name to uid? /etc/passwd file. Once upon a time, the passwd file stored passwords. Now, passwords typically stored in /etc/shadow Difference being /etc/shadow readable only by root (the superuser that needs to be able to access the passwords. Some idiot* decided to put lots of other info in /etc/passwd, which get used by other non-root processes. (*not necessarily an idiot; I just mock the decision to add it into passwd.). How did it store such passwords? 1) it would probably be a good idea not to store the passwords in plaintext. 2) store? MD5(password): if you know the password, you can compute md5(password) and know if you got it right. problem 1) the space of choices for passwords is kinda small. you can try them all. you might see the same password twice. 2) can pre-compute the f(password) here. and store instead of the dictionary of all words to try, the list of all f(dictionary word) outputs. 3) {hash(password, salt), salt} salt perturbs the algorithm in 2^12...2^32(?) possible ways in this scheme, if you compute hash(aardvark,12) it doesn't help you further down the list of passwords. (must compute the hash(password, salt) for each (dictionary word, salt) combination) Tenex. Before unix. Thought to be very secure. stored the password explicitly. (none of the hash stuff.) code for determining if you could get in was: for(i=0;i<8;i++) { if(entered_password[i] != real_password[i]) goto login_failed; } gave access to the source code to a team of people. give them a normal user account. scheme: force a page fault to occur at each boundary between characters, so that if the right character was chosen -> page fault. wrong character, no page fault. all you have to do to get a password, is choose the individual characters correctly. Thompson. "Reflections on Trusting Trust" assert: turing award lecture. first rootkit. want to build a backdoor into any system. obvious beginning: modify login.c if(!strcmp(user,"secretme")) { login as root without checking the password. } people might notice that this code is there and warn others. step 2: modify the compiler so that when it is compiling login.c, it inserts the code to let you login. problem: people might notice in the source to the compiler that there's a piece of code there to insert the backdoor. step 3: compile the compiler so that it inserts the backdoor insertion code when compiling the compiler. step 4: obfuscate the back door. small problem that your backdoor virus-spread code might break stuff and be detected or be too specific and die. be very afraid. Review of encryption keys. public / private key. standard assumptions apply. what happens if I encrypt something with your public key? it can only be decrypted with your private key. only you can read it. what happens if you encrypt something with your private key? a) everybody can read it. b) everybody (can, if your public key is among relatively few possibilities) knows you wrote it. what if I encrypt something with your private key? invalid question. signature: basic idea: message, { message }_Kprivate better scheme: message, { hash(message) }_Kprivate certificate: scheme: Kotherguy_public, otherguy_info, { hash(Kotherguy_public, otherguy_info) }_Kprivate IF neil and aaron share a secret S. can make slightly faster signature behavior by message, hash(message, S) Let's assume that we are evil. During the run of a protocol (some alices and trying to talk to some bobs) A1: brute force try to break the crypto. if you have a fast enough machine, people who configured the devices didn't use large enough keys. A2: attack the message. encrypted pieces of a message could be separated... you happened to see a message: daniel should get extra credit qnavry fubhyq trg rkgen perqvg nathan should never get credit anguna fubhyq arire trg perqvg could substitute pieces of still-encrypted but known-meaning messages to get something terribly wrong (but still looks good).. this is why there's block chaining, so that the individual parts of a message are tied together. A3: attack the keys - using the dictionary. A4: beat the protocol by replaying old messages. - BAN to address. Needham and Schroeder came up with a protocol for authentication based on symmetric keys. It became popular, and then somebody found a hole. Facing this shame, they set out to never let that happen again. Goal of an authentication protocol: A knows it's talking to B. B knows it's talking to A. optional: A knows that B knows it's talking to A. B knows that A knows it's talking to B. mutual authentication. The protocol at issue: A->S: A,B,N_a S->A: {N_a, B, Kab, {Kab, A}_Kbs}_Kas A->B: {Kab, A}_Kbs B->A: {Nb}_Kab *** A->B: {Nb-1}_Kab Problem: E can, having sniffed all of these packets, crack Kab. (long after everything's finished.) E can replay message 3. (just knows what it means, doesn't have to synthesize it) Can then answer the Nb challenge thing. B believes this is a good key, B believes "alice" has been authenticated by S. Lecture 27: -- The people in charge would like me to remind you about course evaluations. I will post a link very very shortly for an evaluation that has questions I care about. likely on the forum. please do that one too... but it has no consequences. (and you can complain in detail.) Plan: BAN, Lamport logical clocks, Byzantine, 2-phase-commit. Final review. Done. Enumerate assumptions: [A,B] believe S has jurisdiction over A<-K->B A can generate a fresh Na, B a fresh Nb. [A,S] believe A <-Kas-> S [B,S] believe B <-Kbs-> S Rules: message meaning rule: (X is a message) A sees X if X is associated with a secret (e.g. Kab between A and B, encoded with B's private key) that can convert into: A believes B once said X nonce verification rule: A believes B once said X if A believes X is fresh. A believes B believes X jurisdiction rule start from: A believes B believes X if we add: B has jurisdiction over X get: A believes X freshness is contagious: if something is fresh in a message, all things are fresh in that message (the sender is assumed to believe the things it sends.) "in the message" likely best defined to be protected/made an atomic message by encryption. A->S: A,B,N_a S->A: {N_a, B, Kab, {Kab, A}_Kbs}_Kas apply message meaning: A believes S once said {N_a, A <-Kab-> B} apply our freshness rule: A believes S believes {A <-Kab-> B} apply jursidiction A believes {A <-Kab-> B} A->B: {Kab, A}_Kbs apply message meaning: B believes S once said {A <-Kab-> B} we get no freshness. B->A: {Nb}_Kab *** B once said Nb B once said {A <-Kab-> B} A->B: {Nb-1}_Kab get nowhere (B shouldn't believe A<-Kab->B, because he doesn't know that S believes it.) Kerberos: A->S: A,B S->A: {Ts, B, Kab, {Ts, Kab A}_Kbs}_Kas A->B: {Ts, Kab A}_Kbs, {A, Ta}_Kab B->A: {Ta+1}_Kab Ts and Ta are timestamps. Kerberos requires synchronized clocks. List assumptions: [A,B] believe S has jurisdiction over A<-Kab->B [A,B] believe Ts is fresh. [B] believes Ta is fresh. [A,S],[B,S] believe [AB]<-K[ab]s->S (Kas and Kbs are good) S->A: {Ts, B, Kab, {Ts, Kab A}_Kbs}_Kas meaning: S once said ... freshness: ... message is fresh. nonce (Ts) verification: S believes ... jurisdiction: A believes ... S is the KDC, providing the TGT (ignore if you don't recognize the acronyms.) Diffie-Hellman is neato. Lamport clocks. == - collections of machines working cooperatively to provide some service. cluster of mail servers maybe. - try to debug when you have many asynchronous processes. create 'snapshots' or an image of what every machine was doing at the same time. rather impractical to get the obvious snapshot. (collect all state at once from all machines.) - key challenge in emulating a snapshot is that all machines have a different clock. - choose instead a model of consistency, of ordering, that is less rigid: causal ordering. if A caused B, make sure that A "happened before" B. otherwise, we don't care so much about the ordering. - Lamport scheme: 1. each message has a logical timestamp. 2. if you send another message, increment the timestamp. 3. if you receive a message with a higher (or equal to your own) timestamp T, your new logical time is T+1. - can presumably better understand the system's state by not having any effects without causes. Byzantine Generals == - two (not byzantine) generals problem: messengers can be caught and killed; only means of communication. hard enough. - byzantine generals: byzantine generals don't play the protocol: imagine we have three armies with generals say to one general: lets attack! and say to another: let's go home. - byzantine failure model assumes that faulty components are (pretty much) arbitrarily faulty. they don't just stop. - typical scheme: make them sign their messages, catch them in a lie, have alternate, redundant, likely-not-faulty spares. Two phase commit == distributed database (pieces of the database are on different machines.) want to construct an atomic transaction atop this database. can't just iteratively commit. 1. everybody "prepare": write the stuff to disk, write that you're "prepared", then respond. now if a node fails, when it comes back, it has to investigate, by asking all of the nodes involved in the transaction whether it did commit. 2. everybody 'commit': only if everybody is prepared. as soon as any of the database servers writes "commit" to disk, it's committed. === Review. 1. exhaust test. 2. give yourself a big enough disk. 3. review outline on-line Review what is an operating system? what illusions does the operating system provide to applications? how do devices tell the processor there's something to do? how does the processor get information from devices? what's a pipe? a named pipe? a socket? what's the difference between a system call and a library call? what's a microkernel? what are the challenges with a microkernel? what are the advantages of a microkernel? what's an interrupt? how does it differ from a trap? how do interrupts help with preemtive multitasking? what's a real time system? what characterizes the apps in a real time system? what characterizes the scheduling in a real time system? what do embedded systems lack? what's a file? a file descriptor? a file system? what is the interface of a file system? what's a shell? how does a shell start (not on geekos)? what's a process? contrast thread. what does a process have? what are its states? what is stored in the PCB? what's a pid? what's a zombie? how much state is needed by a zombie? where are page tables stored? how does a context switch happen? why not switch frequently? what's thrashing? what's a quantum? what are short, medium, and long term scheduling? what distinguishes dispatcher from scheduler? which processes live forever? what does "init" do? what is a daemon? how does a process become a daemon? how is system() implemented? pipe()? how does a shell implement the | operator? why separate processes by pipes not threads? why use shared memory? what is sigpipe? what's the diference between UNIX domain sockets and Internet protocol sockets? why use one over the other? (unix domain is not udp.) know the socket system calls. accept, bind, read, write, close, shutdown, connect, listen. user level threads. what that means. what process is required to create a user level thread, and what (arch-dependent) calls are involved in creating a new user level thread. distinguish from kernel thread. scheduler activations. what and why. thread pool. what and why cpu scheduler goals. types: FCFS, SJF (SRTF), MLFQ. adjusting quantum. detecting i/o bound. what it means to block on i/o or give up the quantum. is fairness good? reentrant functions. don't hold state across invocations. what does signal() do? contrast kill(). alarm(). multiprocessor scheduling: affinity, load balancing. SMT. memory stalls. synchronization, race conditions, locks. reader/writer locks. starvation for whom? the critical section problem: four goals/criteria. Peterson's solution. Applicability. Insight. Bakery algorithm. metaphor. insight. key mechanism. test-and-set/TSL. how to implement lock and unlock. swap/xchg/swp. how to implement lock and unlock. why these primitives aren't enough. semaphores. P/V proberen/verhogen wait/signal how implemented (you know this well.) implement producer/consumer bounded buffer. what does the producer wait for? what does the consumer wait for? implement reader/writer locks. who starves? condition variables and monitors condition variable operations motivation for monitors. implementation of semaphores with condition variables and monitors. implementation of monitors and condition variables with semaphores. producer/consumer with monitors. signal-and-(continue vs. wait) dining philosophers solution with condition variables. transactions: ACID. name each, what they mean, how they are ensured. two-phase locking. what it's for, how it works. when locks can be released. undo and redo logging. deadlock: four conditions. prevention, avoidance, and detection. mechansims for each. incl. banker's alg. how to decide which processes to kill == end of pre-midterm content == memory management: stall, protection, address space. overlays (old school) contiguous allocation, limitations. fragmentation: internal, external. segmentation vs. paging. process address space layout. page table. multilevel page table. page directories. page table bits. TLB. what it does. why page size matters. inverted page tables, hashed page tables. PAE. position-independent code. relative addressing, indirect addressing. why have PIC? what's a page fault? what's the task of te pager? page replacement strategies: FIFO, OPT, LRU, Second chance, Clock, MFU, LFU, working set contrast insight, goals, weaknesses. page buffering, how to decide when to write dirty pages out. page allocation, minimum pages required, global allocation, fairness, proportional, priority. memory mapped files. kernel memory allocation: buddies and slabs. three states of a slab. prepaging: what it is, how it helps. contrast medium-term scheduling. OOM. file systems: interfaces. disk geometry, performance properties. file metadata. berkeley study conventional wisdom about files. contiguous allocation, linked allocation, FAT, indexed allocation, multilevel indexed, early unix file system (very gfs2), fast file system and what made it faster. ext2. NTFS and extents. understand a File Allocation Table. Free blocks bitmaps. why they're good. preallocation. the buffer cache. contrast unified buffer cache, page cache. the log structured file system: motivating insight(s). implementation issues: cascading changes to update a block of data. the ifile and storage of atime. the cleaner. journaling file systems. insight. writeback, ordered, data. redo/undo distinction to log structured. i/o scheduling: FIFO, SSTF/SATF/SPTF, SCAN/Elevator/LOOK, C-SCAN, deadline, anticipatory. what each is good for. what can go wrong. does it help to have deeper queues? what makes more than one block show up on the queue? RAID: levels 0, 1, 5. cost. mirror, stripe, parity. NFS: RPC, XDR, statelessness. access control (uid reporting, exports file) operations. rfc 1094. Security: Thompson, Tenex, /etc/passwd salt. public/private keys, signatures, certificates. block chaining importance. authentication protocol. BAN logic ops: nonce verification, freshness, message meaning, jurisdiction, Two-phase commit. Byzantine generals. Lamport logical clocks. === From the vocabulary: ignore little's law ignore interrupt coalescing. ignore device drriver. [sic] Review process -> daemon. a) no attachment to a tty b) background if(fork() == 0) { close(0); close(1); close(2); /* do my stuff. */ } else { exit(); } PAE: scheme for allowing your page directory to point to a 2MB superpage. potential to use a hierarchical page table to support both large and small pages for that tastes great / less filling best of both worlds joy. slab allocation scheme: construct homogenous arrays of kernel-required objects just so that we dont have to malloc/initialize/free/bzero them frequently. slab allocator tries to maintain mostly-initialized objects for us to reuse. if a slab is *partial*ly full, we'll try to use it. if a slab is *empty*, we'll try to release it (if under memory pressure -- somebody else needs a slab. if a slab is *full*, we're happy. no fragmentation. hooray. if you need another object and all the caches/slabs are full. advantages of the slab: you can pack these things together. avoid fragmentation. avoiding extra reinitialization saves (but unclear how much -- depends on lifetime, difficulty in setting up the object). implementing pipe. remind yourself of popen