CMSC 412: Operating Systems your humble instructor: Neil Spring your awesome TA's: Vasileios Lekakis; Bo Han your excellent project guys: John Silberholz; Calvin Grunewald. Lecture 1: 9/1. vocabulary: review material will be in understanding, defining, stating the utility of different concepts identified by terms. these notes; everything I type, will appear on-line. - caveat: i may not type everything important. - i can't type the whiteboard. significant programming assignments in this class. - in C. - debugging can be difficult. - you will pass this class iff you complete these assignments. goals - modify; extend; understand... syllabus review. forum.cs grades.cs.umd.edu cs412xxx submit.cs.umd.edu linuxlab.csic.umd.edu (can login remotely) What's an operating system. - program that's "always" running. - interface between hardware and (other) software - the manager of resources, devices. - feature: has "complete" access to the hardware. - provides an illusion (abstract machine) to applications: - isolation: the app appears alone - incredibly large, linear space of memory. - files and directories - as if devices had buffers (e.g., character devices) - ?? - and also the means to break that isolation - shared memory, pipes, locks, signals. -> ignore: <= 2000 argument from ms antitrust case over ie being "part of the os" - we'll treat the "kernel" and os as pretty much the same thing. (no GUIs, no X, etc.) (no shells.) Is printf() a system call? Applications Libraries ------- <= system call interface Kernel Hardware Apps will invoke a system call, that will transfer the running execution into the kernel. printf() will instead format all the strings, and call write(1, buffer, length); <- the write() system call. ^^ STDOUT_FILENO there exists the syscall table, an array of function pointers, the index of which is placed in a processor register when the call is invoked. File descriptor vs. file pointer. fprintf(stdout, "blah\n"); (same as printf("blah\n");) - or - FILE *f = fopen("/tmp/x", "w"); << not a system call. fprintf(f, "blah\n"); f points to a structure that includes a buffer (I think) and the file descriptor number. Kernel and the application share a view of an array of files, the file descriptor is the index into this array. fopen will call open, which will return the next available index into this array. (alternate meaning of file pointer is the current position in a file). If you have a function in mind, how do you tell whether it's a system call or a library call? - man (function) -> if in section 2, it's a system call, if in 3 it's not. - strace (command) from the shell. Memory management: library vs. kernel: malloc (and friends) are implemented as either: if you've just freed space enough to return, you return it (no kernel involved) if it's small: brk() system call. - extends the top of the heap. if it's large: mmap() will grab pages at a time. C refresher exercise and homework (due thursday). -- Lecture 2. 9/3 refresher bit. scores on-line. F+2*2 do { functioncall(); otherfunction() } while(0) a a a a b b b b c c c c d e - - f f f f g g g g h h h h 0 1 2 3 4 find() return( (*(int *)elt == *(int *)storage ) ? 0 : 1); return( (*(int *)elt != *(int *)storage ) ); extern keyword -- 1. if you're not defining the function, declares it as being defined somewhere else. (useful) 2. if you are defining the function, for style, tells the reader that you know you don't want to make it static... that it is actually called externally. int y; int *x = &y; printf("%x %x\n", x, x+1); 12 16 printf("%x %x\n", x, ((char *)x)+1); 12 13 what's the type of the literal: const char *p = "hello" p[1] = 'i'; // badness. p[2] = '\0'; // because you're not supposed to mess with a const. #include main() { char *p = "hello"; const char q[] = "hello"; printf("%d %d %d %d\n", sizeof(p), sizeof(q), sizeof("hello"), sizeof(*p)); printf("%x %x\n", p, q); p[0] = 'H'; printf("%s\n", p); } /usr/include/sys/syscall.h the big list. virtual memory (not the same as paging) service that involves allowing the applications/processes to use addresses that aren't necessarily (likely) the same as the physical address. - then enables taking the physical pages and putting them on disk instead of keeping them in memory. having the illusion of larger memory. (paging) virtual machine what's similar? - still seeing an illusion. - memory thought to be physical is likely virtual again. - appearance of isolation what's the difference between the interface made available by an operating system and the interface exported by a virtual machine? - that of the appearance of hardware. - operating system steps in whenever the app generates a trap - virtual machine will step in whenever there's an "illegal" instruction that must be virtualized. microkernel - extreme point in a design space: - its features get incorporated elsewhere. - itself provides very little. the minimum. move all other functionality to special processes. advantage: might be runtime extensible (though loadable modules provide similar benefit) - faults are contained - one piece of kernel functionality being buggy won't break other parts. disadvantages: - cost to switching privledge levels. - these processes have to communicate with each other: message passing. (zero-copy things) interrupt asynchronous notification from the hardware. a device, associated with some number, can tell the processor that it's ready. timer can generate an interrupt, cause the operating system to take control in that interrupt handler, ultimately to invoke the scheduler and start running another process. processor will generate traps based on software events... you can consider the system call invocation of an interrupt as a trap. Lecture 3; homework return. hooray. the operating system is just a piece of code; -> all the things that it manages are just data structures the only thing that might be "magic" are the strange interactions with the processor: - special registers that the processor what to do on certain events (e.g. where the interrupt table is) process program (code), address space, context (thread(s)), resources example resources: devices that are open context: set of registers that represent where the running instance is: instruction pointer, all the other registers. "running instance of a program" data structure: Process Control Block (PCB) fun reading: kernel-source-2.6.8/include/linux/sched.h - instruction pointer - registers - memory management info (address space) - set of open files (file descriptor table) - process id - identifies the process - process id of the parent - parent cares about your fate - user id that owns the process. - device the process is blocked on. - exit status - resource limits - threads, memory, cpu time, file table size. - priority: - static stuff (e.g., what happens when you run "nice") (nice doesn't make the background task completely unobtrusive because this affects only cpu (because of memory limitations) - can go the other direction for stuff that is super important. - stuff that the os knows is important somehow ( - can also be a dynamic property - user input - generalize to having recently received input / stopped waiting for a device; database or web server process is probably waiting on either: disk to return a block or a request to come in. - anything that has been waiting gets bumped up in priority - anything that consumed its entire time quantum last time gets bumped down. -> goal is to reduce perceived response time and to get a mix of active processes. i/o bound and cpu bound at the same time. process state machine running, (on the proc) ready, (ready to go onto the processor, from being put to sleep by the OS or from being awakened by the device) waiting, (need input) stopped (need SIGCONT) ^Z to suspend in the background terminated (zombie) (not considering non-existent) how can a process leave the running state? pre-empted by the scheduler. timer fires, process has used the quantum, another process is "ready". done. calls exit. (or exit_group() which is a distinction that doesn't matter) sleep()/nanosleep() wait for some amount of time to pass. accidental death through memory fault, divide by zero, unhandled exception, killed by another process act of god (sysadmin) blocked on a device could be on a page fault waiting for the disk to give us the page of memory or less obscure (kbd, disk, etc.) three granularities of process scheduling: long term: which processes can start. (more associated with batch processing) medium term: which processes can occupy memory (or be swapped out) short term: which of the runnable processes will get the cpu. (more associated with time sharing) dispatcher vs. scheduler scheduler is the brains of the operation. dispatcher is the brawn: does the task of switch. policy / mechanism separation os people reinventing modularity design an interface so that: mechanism can be made fast. (simple instructions to follow) (implemented where readability does not matter) policy can be made smart. (configurable) (modifiable) the first user process geekos: shell.exe unix: init launches daemons what makes a process a daemon? in the background, waiting for something to do having no association with a tty; spawns getty (something that grabs tty's, then spanws login to presents login/password prompts on associated devices) tty is a representation of a console with display/keyboard init also has responsibility to wait() to collect the exit status of orphan zombies. ---- Homework 2 posted. Due tuesday. Project 0 due thursday. Project 1 due oct 1 (I think) not yet official/posted. Homework 1 to be returned in section (if you care) grades on grades.cs; generally high scores. pieces of the operation of the shell the system() function. is system a system call? no. system() takes a command (with the full line and everything) to run, and runs it with sh, and returns the exit status of the new process when the it is done. how might you implement this function? exec() replaces the current program with a new one. (keeps the pid, replaces the address space, starts at the top.) (there is an exec "family" of functions) if something goes wrong: -1 fork() takes the current process and "duplicates" it into one "parent" (that retains its pid) and a child (that gets a new one) returns for the parent: > 0 (the child's pid) returns for the child: 0 when something goes wrong: -1 wait() set of functions (we'll use waitpid) will wait for a child to finish, and collect the exit status. int system(const char *command) { pid_t child = fork(); if(child == 0) { /* we are the child */ const char *arguments[4]; arguments[0] = "sh"; arguments[1] = "-c"; arguments[2] = command; arguments[3] = NULL; execv("/bin/sh", arguments); /* should replace the program here */ /* could put error checking code in here... */ exit(EXIT_FAILURE); /* if ever we got here, we should be dead. */ } else if(child > 0) { /* we are the parent */ /* wait until the child is done and */ /* collect the exit status from the child. */ int exit_status; waitpid(child, &exit_status, NULL); /* block */ return exit_status; } else { // < 0 /* we are screwed */ return 127; /* dunno quite why */ } } int do_you_really_understand() { int a = 4; pid_t child = fork(); if(child == 0) { a = a+1; /* we are the child */ printf("%d\n", a); } else { printf("%d\n", a); } } now that we can isolate processes, we have to allow processes to communicate. tasks that are accomplished through coordination between processes. - e.g., daemons. terminal/x programs. clipboard. multi-process web server. databases anything at least as big as mysql (and possibly others slightly smaller) that require a separate process to handle the db. implementation of "man" is something like "nroff -Tman system.3 | less" (coordination between the text formatter and the pager) why not implement the entire task inside a single process? in some cases, you could take the processes and run them on different machines. (requires one of a few interprocess communication mechanisms) isolation (different from modularity) services provided by privledged daemon to unpriv. clients. two processes have different uids; potential concurrency: individual processes can block at different times; counterargument: could use threads instead. users can compose these processes at runtime. software engineering benefits of breaking up sub-tasks (not that important here) different people can implement different processes, wire them together way after compile time. What are our options for communicating across processes? sockets: unix domain sockets (internal to the machine) ip sockets (look pretty much the same) expect two-way communication. may have message framing. multicast/broadcast functions, explicit destinations, etc. shared file. (almost always works unless order matters. or multiple writers.) shared memory. segment of memory mapped into the address space of more than one process. not necessarily at the same virtual address. something might be there already. (I think you can rely on them being page aligned) arguments/environment on exec() of a child (not quite counting here) anonymous pipe. pipe metaphor implies: 2 ends. one way flow. "descriptor pair for interprocess communication" pipe() - provides both file descriptors. fork() - gets those file descriptors to both processes. nroff | less popen() named pipes. "named" part puts the pipe in the file systems example of when it makes sense to use a named pipe is question 8 of homework 2. typically, you'd rather use a socket on a well-known port instead of a pipe at a somewhat well-known name. signals. lots of signals that stop short of -9. sigpipe. locks: semaphores, monitors, ---- Homework 2 please. Homework 3 on the web page. Midterm: Tentatively October 20, unless there is significant revolt. Thursday. Sorry, I can't cancel class. Jeff will hurt me. I will not be here tuesday; may trade with the wednesday section. (or may skip class)... I'll know on thursday. PA1 later today or tomorrow morning. Plan: shm calls, popen implementation, socket interface calls. Shared memory: a region of memory (segment) that is mapped into the (virtual) address space of two or more processes for storing data. (text segments can be "shared" across processes, but since they're not (really) writeable, no one notices and that's not called shared memory) shmget - allocate or fetch a shared memory region - if it's not already there, it allocates shmctl - sets the permissions associated with the memory region - don't want just anyone writing to our memory. - also involved in freeing the shared memory region. shmat - attach (like open) takes a hint of where in the address space to place it. shmdt - detach (like close) To pass data from process P1 to process P2 using shared memory: write it. tell him: signal (valid, probably a bad idea, since it interrupts the receiver) sockets pipes locks (e.g. semaphore) etc, (but we were trying to avoid at least some of that) poll - keep checking a value (like a sequence number) incremented by the sender and checked by the receiver. poll() the system call allows the caller to check on the ready status of several file descriptors (similar to select()) acknowledgement: same thing in reverse. popen(). int popen_w(const char *command) { // not the real popen /* step one: get the pipe */ int fds[2]; pid_t child; pipe(fds); /* two file descriptors: the read end, other write */ child = fork(); if(child == 0) { /* we are the child */ /* replace stdin, because we're doing the write */ close(STDIN_FILENO); /* 0 */ dup2(fds[0], STDIN_FILENO); close(fds[1]); /* book says important. guess so. */ const char *arguments[4]; arguments[0] = "sh"; arguments[1] = "-c"; arguments[2] = command; arguments[3] = NULL; execv("/bin/sh", arguments); /* should replace the program here */ /* could put error checking code in here... */ exit(EXIT_FAILURE); /* if ever we got here, we should be dead. */ } else if(child > 0) { /* we are the parent */ /* wait until the child is done and */ /* collect the exit status from the child. */ // int exit_status; // waitpid(child, &exit_status, NULL); /* block */ close(fds[0]); return fds[1]; } else { // < 0 /* we are screwed */ return -1; } } Socket calls. Work for INET (IP) sockets, UNIX sockets, probably many more (that no one cares about) socket (like open) two options for the type of communication: PF_INET, SOCK_STREAM -> TCP PF_INET, SOCK_DGRAM -> UDP socket(PF_INET, SOCK_xxx, 0) socket(PF_UNIX, SOCK_xxx, 0) I can't think of a case when a datagram would be chopped by the system and it would be okay to receive it in pieces. (Streams allow reading smaller (or larger) chunks than sent) what is the return value of socket()? a file descriptor. bind - endpoint of the socket will be fixed (UDP port, or TCP port optionally requested, definitely assigned) (in UNIX sockets, it's identified by a path in the file system) sockaddr_storage (the generic) sockaddr_in (for ip) sockaddr_un (for unix) listen - puts the socket in a state where it can, in future, accept() connections. accept - actually fetches the new connection. - input parameter (existing, listen-ing fd) - return value: new file descriptor for the new connection. connect - given the address of an endpoint that has called bind(), connect the socket. can you call connect() after bind()? (on the same socket) - bind fixes the local endpoint, connect fixes the remote endpoint. you can fix both. - examples where you'd want to: peer-to-peer application (bittorrent) you might want your connection requests to come obviously from the same process on the same machine. (maybe) if you want the application to convince others that it's running with priviledge (IP ports under 1024 are reserved for system processes (running as root)). can you call bind after connect()? (on the same socket) - no, when connect()ing, the bind() is implicit. read / recv / recvmsg write / send / sendto close shutdown allows you to half-close; tell the kernel that you will write no more or read no more. can be used to indicate the end of a request. SHUT_WR argument: no more writing causes the other side to read EOF. SHUT_RD argument: no more reading why bother? how would the other side learn that it is ignored? SHUT_RDWR argument: no more reading or writing, but somehow not closed. select / poll we went over last time: for all the sockets that are open or that you're interested in, which ones are "ready" for reading - read wouldn't block. - something in the buffer. or "ready" for writing - write wouldn't block. - buffer isn't full (or in an exception state) could treat it: fcntl allows you to set non-blocking options on all files) useful but not really here gethostbyname (not a system call) Agenda: Homework notes - if there's a yes or no question, answer yes or no in the first sentence. the answer is almost never "both" or "it depends" - many questions from the book, digesting the information is distinct from paraphrasing the book. Project stuff Finish sockets (check) Finish processes process defined: running instance of a program things that make up a process: code, address space, context(s), resources what state does the kernel keep about a process (in the PCB)? execution context(s): registers (especially when it's not running) pid ppid user it's running for device queues it's on. priority address space open files. Start threads what does "thread" mean: a context within a process, having its own stack. why threads consume less resources (somewhat) share address space EXCEPT the stack. share fd tables (don't just get a copy of the parent's as in fork()) share other kernel resources. inter-thread communication is cheap. creation and destruction are (somewhat) cheaper threads provide (and processes do to) use of many cores - aside from "processor affinity"; the idea that a process is best kept on one processor so that you don't thrash in the caches how can use threads to segregate responsive parts of the app (UI) from the hard-working parts of the application. what kinds windows-16 model java model user-level model can implement threads at the user level: get to do your own scheduling. need the ability to allocate space for stacks need the ability to store and restore a context getcontext/setcontext or setjmp/longjmp don't have to cross user/kernel boundary. (good) don't get to use many processors/cores. (bad) ultimately, system calls have to be made non-blocking (or block the whole process) pthread model interactions with signals scheduler activations -- HW4 Due Oct 6. Threads models matching user-level stuff to kernel level threads many to one - "green" threads, gnu pth one-to-one - typical pthreads (as in the assignment) many to many - more user level threads on top of some kernel level threads. why? -- expect that the resource requirements for a kernel thread are high (prohibitive to run as many kernel threads as you have concurrent ideas) why not many-to-one instead? -- multiple processors/cores. want enough kernel threads to keep all the processors busy. (get enough kernel threads to keep the processors busy, but not as many as there are potential user threads) two ways of achieving this goal: thread pool. (this is what you'll see) threads created in advance or recycled for new requests. decide how big the pool is only makes sense when thread creation is at least a little expensive might limit the size of the pool to avoid thrashing (oversubscribing another resource into performing more poorly than it would with fewer requests) a design like a "free list" scheduler activations. (this is something my advisor worked on) imagine solving the problem of extracting concurrency out of programs from the bottom up perspective of the kernel's processor scheduler rather than from the top down software engineering perspective of writing many concurrent threads. change the interface so that the process has the illusion being scheduled on a virtual core; when the kernel finds an idle processor (needs more concurrency out of the process), it gives the user-level scheduler a new "virtual processor" on which to schedule a user-level thread. thread stacks... https://developer.mozilla.org/en/Using_Addresses_Of_Automatic_Variables_With_NSPR CPU Scheduling Piece of code in the kernel that decides what runs next. Why should it be smart? utilization (keep the processor busy) prioritize important things important: low-level "priority" "nice" value high-level things the user cares about: interactive things. what's interactive? uses IO. If it returns from being blocked on IO, it might be "interactive" (unless it's a virus scanner, backup process, ...) metrics you could consider. turnaround time: execution plus waiting time. waiting time: time spent waiting for the processor. response time: time to first results. Strawman scheduling scheme FCFS. first thing to arrive gets to run to completion, then the next one to arrive runs. what's wrong with it? if the first thing that arrives takes much longer than the rest, then the interactive jobs (those with short run times) have to spend much more time waiting... relative to their run times, they're slowed down a lot. SJF. Shortest job first. assumes magic. "an oracle" - or users to promise an upper bound (for batch scheduling) - or a prediction of future job length (for scheduling time slices) advantage: reduce the amount of time short jobs have to wait for longer jobs to get out of the way. long jobs might starve, even though they're also important. long job might not be that much larger. "Fair"? maybe. depends. Multi-level feedback queue: reasonable balance between ease of explanation and usefulness. quantum: the time slice allocated to a process. between 1ms and 100ms. may even vary by process (permitted by this scheme). a few queues, processes that have short bursts of computation in the "top" queue. These are implicitly prioritized because they don't use their quantum. if a process exhausts its quantum, it descends to the next level. not as highly priority. at this lower level, may increase the length of the quantum, but the queue will be serviced less often. if a process doesn't exhaust its quantum before blocking on IO, wakes up at a level above. effect: IO bound processes have better response time. CPU bound processes keep the processor busy. lengthening the quantum for cpu bound processes: when they do get on the processor, they don't have to start up again and again; potentially a 40ms quantum would be more productive than 4 10ms quanta. current geekos scheduler: single queue. round robin-ish. except for Find_Best, which picks the highest priority process always. (except that I don't think priority is used.) how might you alter a scheduler of a server system (as opposed to a desktop system)? on a server, response time is less important. turnaround time a little less important. throughput is more important. (might involve reducing response time... but perhaps not.) how might you alter the scheduler on linuxlab? cs412013 processes first. schedule student processes at somewhat lower priority (protect the system) deadline-aware scheduling: cs412 over cs417 depending on week. we'll answer this better on tuesday. === 9/29 how might you alter the scheduler on linuxlab? cpu time quotas -- a limit on the amount of resources you could use as one person at a time. - could use limit or ulimit (depending on the shell) to ensure that your processes don't use more than some limit on resources. - doesn't actually require modifying the scheduler, but need to add a field to the PCB describing the limit. - if we have runaway processes on linuxlab, I might add calls to setrlimit / limit to the makefile... make the quantum small - permit more fine-grained sharing - big quantum -> less switching, more time making progress. - small quantum -> less wait for others to finish their quanta, more progress. first, schedule users, then processes within those users. - perhaps: one multi-level per user, round robin across users. (digress a bit into definitions of "time" in an OS) wall-clock time is time you're running plus time you're not. - if you were watching the clock on the wall distinction to user time, system time, - user time - time spent for that process at user level - system time - approximation time spent for that process in the kernel. - if almost all user time, probably little i/o, little other system-intensive operations... probably all math. - if almost all "system" time, probably the reverse... defrag. index. big database stuff. what is/are the goal(s) of cpu scheduling? max throughput (get stuff done) max utilization (keep the processor busy) max responsiveness (get short things done quickly, long tasks can wait if needed) follow the user's policy (run high-priority stuff with priority) (remember policy vs. mechanism separation) how does the scheduler achieve those goals? some set of policies / queueing schemes (FCFS, SJF, multi-level...) directives from above to follow keeps track of the prior behavior of the program as a prediction of what it will do in the future. - cpu burst time. whether the process has been using the whole quantum it has been given. whether we should think it will give up the processor quickly if the process is given the cpu. - might be encoded in the process's dynamic priority. Fig 5.12 pg 207 dynamically "high" priority (e.g., 59) -> - small quantum (20ms) - if you go to sleep, when you awaken you'll have high priority (59) - if you exceed the quantum -> priority 49 "low" priority (eg 10) - big quantum (160ms) - if you go to sleep, when you awaken you'll have high priority (50) - if you exceed the quantum -> priority 0 what considerations affect scheduling multiple processors? max throughput (get stuff done) max utilization (keep the processor busy) max responsiveness (get short things done quickly, long tasks can wait if needed) follow the user's policy (run high-priority stuff with priority) (remember policy vs. mechanism separation) * and * processor affinity a process would be happier not moving. keep the cache warm. pretty soon we'll have to worry more about synchronization because you might want to have two processors both in kernel code at the same time, or two processors running two threads of the same process at the same time. (not quite a scheduling problem) (also deadlock) SMT: insight, implementation simultaneous multi-threading put thread contexts into the processor so that they all "run" at the same time. try to ensure that the processor has stuff to do while during the memory stall (loads from memory) implementation of this is largely to appear as a separate "core" (as another processor) to the kernel / scheduler / etc. kinda like scheduler activations (interface for expressing the ability to run another thread simultaneously is of providing another "processor") little's law in steady state (arrivals \approx departures) : number of things in the queue = service rate \times wait time Synchronization: java has a "synchronized" keyword: implements monitors (friendly synchronization scheme) semaphores - binary and counting reader/writer locks (shared and exclusive) what problem do synchronization operations solve? race conditions. book: "when the outcome of execution depends on the particular order in which the access takes place" a flaw in which timing (luck) determines whether the outcome maintains system invariants (the consistency of your data structures). imagine the kernel switching contexts at the absolute worst possible time. digression: ACID (from databases / transaction processing) atomicity consistency isolation <- what locks will provide. durability (don't solve deadlock) why do/should we care about synchronization? (in OS; this isn't a databases class...) - we already talked about shared memory. if you wanted to keep data structures in shared memory consistent, you'd have to help applications guard access. - is it possible to implement synchronization operations at the user level? need an atomic operation. (test-and-set-lock, or an atomic swap (xchg)). if you don't get the lock, you'd like to rest while the other process finishes its critical section... for that you might want kernel help. - kernel might want to run on multiple processors at once. if you were to take current geekos or very old linux, how well would they run on a multiple processor machine. "one big lock" on all kernel data structures. disable_interrupts() call. Goals: 1) don't break data structures 2) run lots of stuff at once, subject to (1). Next time: critical section problem, peterson's solution, bakery algorithm. --- This time. PA1 due tonight. Minor C tricks. (comparisons involving unsigned, pointer/array differences, volatile keyword) volatile - stores go to memory, loads go to memory. presumably for globals. don't optimize out loads-after-stores. don't optimize it into a register (leave it in memory) contrast "register" keyword, which is considered only a hint. A few PA1 questions. Wait on not your child shouldn't really have to exit with -1 specifically if it was going to exit with -12, that's probably better. (as long as it's < 0) block device queues shell exits on getkey fail (print anything if you want. or don't.) use grep. find . -name "*.c" | xargs grep hmmm Reentrant and signal handlers. what you should and should not call from within a signal handler. exist functions not intended to be called twice at the same time. they have shared internal state. if unlucky and interrupted in the middle of execution, another invocation could clobber state. NOT reentrant: strtok interface is (a) clobber input by replacing token delimiter characters with '\0' (end of string character) (b) return the beginning of the next token (a pointer to the string). (c) *internally store* the beginning of the next token. ALSO: gethostbyname function that, given a hostname, returns ip addresses (and some other information); allocates its own buffer for the result and returns a pointer to that buffer. nice: don't have to allocate a buffer for it. crummy: if another thread were to call it, state goes poof. (even if not called simultaneously). use getaddrinfo instead. System calls that return an error, may leave (global) errno value in an unhappy state. Theme: don't store intermediate results as global variables internal to your interface. Goal : protect data structures from bad effects of concurrent access. : make forward progress. Critical section problem defintion. (a) only one process can be in its critical section at a time. (b) "progress" any process not in its critical section can't block those that want to be in. (c) "bounded waiting" can't wait forever (d) can't assume anything about relative CPU speed. Simple scheme: disable interrupts (as in GeekOS) why is this not general: can't let applications do it, lest they hose the machine. in the kernel, it is coarse-grained: you might only have to disable related interrupts and not "more important" interrupts (if such things are available) screwed if multi-core / multi-processor. (doesn't inhibit concurrency, unless your process is pinned to just one cpu (and then, what's the pont)) Simple (ineffective, useless) scheme #2: "lock" variable. if lock is 1, it's held by someone. if lock is 0, it's available. to lock: while(lock!=0); lock = 1; to unlock: lock = 0; what can fail? Thread A Thread B top: load lock bne lock, 0, top load lock beq lock, 0 store lock, 1 store lock, 1 without any help from the processor, both threads "get" the lock. Nifty version: strict alternation when thread A is done with its section, tells thread B it can be his. initial state: turn = 'A' Thread A lock: while (turn != 'A'); unlock: turn = 'B' Thread B lock: while (turn != 'B'); unlock: turn = 'A' busy waiting issue (which we'd prefer not to have). if we released, we have to wait for the other thread to acquire and give up before it's our turn again. Peterson's solution: me, him = ( 0, 1 ) depending on who i am. shared state: int turn bool flag[2]; to lock / enter critical section: flag[me] = true; turn = him; while(turn == him && flag[him]); to unlock / leave critical section: flag[me] = false Thread A (0) Thread B (1) flag[0] = 1 flag[1] = 1 % turn = 1 % turn = 0 turn == 1 && flag[1]? - should fail, enter critical section turn == 0 && flag[0]? should succeed. loop. [ in critical section ] flag[0] = 0 to unlock/leave turn remains 0, but we leave the loop because of the flag ---- 10 / 6 Project 2 posted BEWARE KATE. (and its lack of newlines at the end of the file.) add newlines at the end of the file. Better yet: use a proper editor. (emacs, vim...) #include #include struct foo { char name[128]; // 128 characters in the structure } f; /* vs */ struct bar { char *name; // a pointer to a character in the structure } b; int main() { printf("foo: %p == %p\n", &f, f.name); printf("bar: %p != %p == ??\n", &b, b.name); exit(EXIT_SUCCESS); } #include #include int main() { unsigned int f = -1; int j = -1; unsigned short us; if (f<0) { printf("hello\n"); } if (f==-1) { printf("promoted?\n"); } if(j==-1) { printf("this would be printed\n"); } us = j; if(us == -1) { /* compiler will tell you this can never happen */ printf("whee\n"); } exit(EXIT_SUCCESS); } Bakery. Hardware (tsl, xchg) Semaphores: interface and implementation Similar scheme: Bakery algorithm. (use wikipedia) analogy to a "bakery" - take-a-number style ordering. // declaration and initial values of global variables Entering: array [1..N] of bool = {false}; Number: array [1..N] of integer = {0}; 1 lock(integer i) { 2 Entering[i] = true; 3 Number[i] = 1 + max(Number[1], ..., Number[N]); 4 Entering[i] = false; 5 for (j = 1; j <= N; j++) { 6 // Wait until thread j receives its number: 7 while (Entering[j]) { /* nothing */ } 8 // Wait until all threads with smaller numbers or with the same 9 // number, but with higher priority, finish their work: 10 while ((Number[j] != 0) && ((Number[j], j) < (Number[i], i))) { 11 /* nothing */ 12 } 13 } 14 } 15 16 unlock(integer i) { 17 Number[i] = 0; 18 } 19 20 Thread(integer i) { 21 while (true) { 22 lock(i); 23 // The critical section goes here... 24 unlock(i); 25 // non-critical section... 26 } 27 } Hardware instructions: test-and-set (tsl) takes a memory address and a register as an argument atomically: (by locking the memory bus) register <- *memory *memory <- 1 lock: do { test-and-set(&lock, register) } while(register == 1) unlock: *memory <- 0 swap or exchange (xchg) takes a memory address and a register as an argument atomically: (by locking the memory bus) tmp <- *memory *memory <- register register <- tmp lock: register = true while(register) Swap(&lock, register) unlock: lock = false what are the downsides: (a) busy-waiting. both cases, we're stuck on a while loop. not necessarily bad, if you have reason to believe that the wait will be very short. -- if the time it will take to busy wait is less than the time it will take to set up the data structures to block, then yield (context switch), you might as well keep spinning. (b) a thread that just gave up the lock could reaquire it before a waiting thread can get it. (doesn't solve bounded waiting) Spinlock (also taken from wikipedia. :) lock: # The lock variable. 1 = locked, 0 = unlocked. dd 0 spin_lock: mov eax, 1 # Set the EAX register to 1. loop: xchg eax, [lock] # Atomically swap the EAX register with # the lock variable. # This will always store 1 to the lock, leaving # previous value in the EAX register. test eax, eax # Test EAX with itself. Among other things, this will # set the processor's Zero Flag if EAX is 0. # If EAX is 0, then the lock was unlocked and # we just locked it. # Otherwise, EAX is 1 and we didn't acquire the lock. jnz loop # Jump back to the XCHG instruction if the Zero Flag is # not set, the lock was locked, and we need to spin. ret # The lock has been acquired, return to the calling # function. spin_unlock: mov eax, 0 # Set the EAX register to 0. xchg eax, [lock] # Atomically swap the EAX register with # the lock variable. ret # The lock has been released. Semaphores: Two operations on a semaphore: wait, signal <- let's pick this terminology for now. decrement, increments down, up proberen, verhogen (dutch inventors) (P, V) Semaphore has a value, counter (binary semaphores as a special case) (when I say semaphore, I mean counting semaphore) counter cannot become negative (can have many waiting threads, but the counter doesn't become negative. counter can become positive and even large, typically to represent the availability of resources. spin-waiting implementation of semaphores: signal: counter++; wait: atomically: while(counter <= 0) {} while(not_done) { counter--; if(counter > 0) { atomically: not_done = false; counter-- } } /* this will be wrong. it will not compile. you will have to fix it. */ wait: (without busy waiting, but intended for synchronizing processes) Disable_Interrupts() (so that our operations are relatively atomic...) (probably not necessary since we're probably in a system call and they're already disabled...) if(counter > 0) { counter--; // we have the lock! hooray. Enable_Interrupts(); return; } else { /* counter <= 0 */ /* put ourselves on a thread queue that represents all the threads wait()ing on this semaphore */ Add_To_End_Of_Thread_Queue( semaphore -> waiting_queue, g_currentThread ); Enable_Interrupts(); /* maybe */ Schedule(); /* maybe */ } signal: /* could make a decision about whether to give up the cpu to the signaled process based on priority */ /* make sure that interrupts are disabled */ if(counter == 0) { /* we might have someone waiting */ if(there's_something_on_the_queue) { thread_to_wake = Pop(semaphore->waiting_queue); /* or Wake_Up_One ?? */ /* do we yield here to the other process or not? */ } else { counter++; ?? } } else { /* nothing waits. */ counter++; ?? } bounded buffer / producer-consumer problem. classic motivation for counting semaphores. two processes: one that produces data, one that consumes it. the producer may run faster than the consumer. if we don't watch out, the producer might format/generate so much data that we run out of memory. or the consumer gets too far behind. or the producer uses more than its share of cpu. between these two processes, might imagine something like a pipe. pipe has a buffer. that buffer has a capacity. goal: block the producer when the buffer is full. block the consumer when the buffer is empty. implemeted using two^H^H^Hthree semaphores: one to block the producer when he's filled all the buffers. one to block the consumer when he's consumed all the buffers. one to protect the buffer from concurrent access. (there are likely some pointers involved.) initial condition: consumer should be blocked. Semaphore_of_full_buffers.counter = 0; producer should be able to run. Semaphore_of_empty_buffers.counter = number_of_allocated_buffers. producer: i = produce_item(); /* need not be in any critical section */ wait(Semaphore_of_empty_buffers) wait(buffer_mutex) insert(i) into buffer signal(buffer_mutex) signal(Semaphore_of_full_buffers) goto producer; (loop) it is almost never the case that a lock will make sure that it's "owner" is the one who calls unlock(). be very afraid. if you believe that only the owner should unlock, write some code to verify that. consumer: wait(Semaphore_of_full_buffers) wait(buffer_mutex) i = get_from_buffer() signal(buffer_mutex) signal(Semaphore_of_empty_buffers) consume(i) goto consumer; (loop) -- HW5, the cpu scheduler. (it's fun, really it is.) posted. due oct 22. (it has to be due sometime.) Reader-writer locks: want to support concurrent reads, exclusive writes. readers to have shared access. writers to have exclusive access. hope that "readers" can be done quickly, or in bounded time. in transaction-land, a query will grab the read lock and hold it until it's done with the transaction (which could be a while if it's waiting for other locks) options: 1 readers can't be stopped by a waiting writer. 2 writers can't be stopped by a continuous flow of readers. implement this interface using semaphores; choose option 1. S_writing_mutex -- to be held by the writer, or (collectively) by the readers S_reader_mutex -- to be held by the readers when updating the count of active readers num_active_readers -- the number of readers that concurrently (collectively) hold the lock. as a writer: wait(S_writing_mutex) perform the write. signal(S_writing_mutex) as a reader: // don't want to do anything if a writer holds S_writing // if there are active readers, we're clear to proceed. // existence of an active reader -> readers are permitted. // "pretend" that all readers are one writer. wait(S_reader_mutex); num_active_readers++; if(num_active_readers==1) wait(S_writing_mutex); signal(S_reader_mutex); perform our read wait(S_reader_mutex); num_active_readers--; if(num_active_readers==0) signal(S_writing_mutex); signal(S_reader_mutex); to alter the scheme for option 2: have to queue/block readers who want to read if there is a writer waiting. try to maintain a counter of blocked writers or add a semaphore that will hold new readers back if a writer holds it. exercise. hooray. Condition variables very similar interface to semaphores: wait (just like a semaphore) signal (like a semaphore, except no counting; no effect if none are waiting) signal-all (another way of implementing signal that might require threads to go back to sleep if they didn't "get" what they waited for.) Monitors (closely tied to condition variables, but epsilon more advanced) protect a set of procedures and objects. an "object" can be guarded by a monitor: only one thread can be "active" in the monitor-guarded functions at a time. first-cut mental model: class foo synchronized { function a { ... } function b { ... } } compiler adds synchronization bits to these functions: class foo { function a { semaphore_wait(this->mutex) ... semaphore_signal(this->mutex) } function b { semaphore_wait(this->mutex) ... semaphore_signal(this->mutex) } } remember, that's a white lie. recall: only one thread can be "active" in the monitor-guarded functions at a time. role of condition variables in the context of monitors is to allow a thread inside one of these functions to block, and later, be reawakened by another one of these functions. an example of the use of condition variables in monitors: rehash the bounded buffer example. monitor OurBoundedBuffer { condition full, empty int count_of_buffers_used; int N; // total number of buffers in the buffer. insert(item) { // use "while" instead of "if" just in case some other // thread might sneak in after we've been signaled (made // ready to run) and take our buffer. while(count_of_buffers_used == N) wait(full); buffer_of_things.push item count_of_buffers_used ++; // conditional here is probably entirely optional, // aside from maybe performance if(count_of_buffers_used == 1) signal(empty) } remove() { while(count_of_buffers_used == 0) wait(empty); item = buffer_of_things.pop count_of_buffers_used --; // conditional sort of redundant. if(count_of_buffers_used == N-1) signal(full) return(item); } } producer: while(not_done) stuff = make_stuff OurBoundedBuffer.insert(stuff) consumer: while(not_done) stuff = OurBoundedBuffer.remove consume_stuff(stuff) A semaphore-based implementation of monitors: (reducing the lie-ness) Procedure() { semaphore_wait(mutex) body of the procedure if(next_count > 0) // count of things that are waiting on a semaphore // because they were signaled but can't take the // mutex. semaphore_signal(next) else semaphore_signal(mutex) } condition variable x: x.wait(): // precondition: "we" hold mutex. x_count++ // the queue will be kept by the sem. // have to allow another procedure to enter, so that we can be awakened. if(next_count > 0) // count of things that are waiting on a semaphore // because they were signaled but can't take the // mutex. semaphore_signal(next) else semaphore_signal(mutex) semaphore_wait(x_sem) // should go to sleep. x_count--; x.signal(): if(x_count>0) { next_count++; /* increment the count of threads that are "in" the monitor, but waiting to get back */ semaphore_signal(x_sem); semaphore_wait(next); next_count--; } else { /* nothing */ } Review Semaphores, Monitors. Using semaphores to implement monitors. java's scheme not *technically* a monitor unless "synchronized" covers all the methods of the class. Producer / consumer using semaphores. Semaphores represent the remaining resources explicitly Monitors use condition variables to unblock the other side. Upcoming Deadlock (intro) Dining philosophers Transactions. Deadlock: one guy waits for another guy who is actually waiting for the first guy and nothing happens. could have more than two... more operating systems-y: they're all waiting on resources, some have the resources the others need to finish. - maybe not providing the output that another needs as input - maybe holding onto a lock on a file FOUR CONDITIONS OF DEADLOCK: have all four, deadlock is possible. avoid/break any one: no more deadlock. 1. mutual exclusion - can't share the resource. if you could share the resource, no one would have to wait. 2. hold and wait - when you're acquiring a set of resources, you get one (some) at a time; collect them incrementally, holding onto the ones you started with. 3. no preemption - preemption: the OS can yank your resource away. might consider a transaction abort as a form of preemption. no preemption - no preemption. 4. circular wait - there's a cycle in the "waits-for" graph (waits-for graph means that there's an edge from a node (process) that waits for the incident node (process)) imagine if you ran popen("bc", "r+"); if you screwed up, you could call read() when you shoudn't have... i.e., when bc is waiting for another line from you... you're waiting on bc, bc's waiting on you.... deadlock. Remedies: Deadlock prevention - don't do the stuff that leads to danger. (break one of the four conditions by design) Deadlock avoidance - gain enough information about the processes so that they block potentially earlier than they would otherwise. avoid an unsafe state. Deadlock detection - if there is a cycle, break it. (if we detect it, then we'll kill a process, reclaim its resources) Prevention: circular wait: construct an order over the resources/locks, and only acquire those resources in that order. hold and wait: limit one resource. (while blocked, not inhibiting others) (unlikely to be practical). one big lock option. two big lock option. the lock that represents at least everything that your thread needs. if another thread needs one of those things, then the lock must cover all of his needs too. (at least all the needs that are concurrent) acquire all resources at the same time. (or a virtualization of that... i.e., grab grab grab if failed on anything, let go of all, repeat until it worked.). once you have the resources, you can run to completion (and give up those resources.) mutual exclusion: somehow make the resources sharable... no preemption: to permit preemption, the resource might be a committed but unused page of memory or block on disk. (material from the future:) priority inversion... if you're a low priority process, but you hold a resource that a high priority process wants... don't want the low-prio guy to slow the high-prio guy down... solution: elevate the priority of the low guy until he gives up the lock. (this assumes that the high-priority guy can't just steal the resource away). low guy "inherits" the high priority of the blocked process. Deadlock Avoidance: (the next class of scheme) Ensure that all the threads "can" finish. Ensure the system is always in a "safe" state. safe is defined as every thread can acquire all the resources it will ever need, and then finish. there's some order where the existing threads that have all the resources they have (and may need) can finish. Applied to the simple example: P1: acquires(r1) acquire(r2) .... release(r2) release(r1) P2: acquires(r2) acquire(r1) .... release(r2) release(r1) P1's needs may be one r1, and one r2 P2's needs may be one r1, and one r2 if P2 holds r2 and P1 holds r1 -> unsafe state. if P1 holds r1 -> we won't allow P2 to get r2. P1 and P2 must tell the OS in advance of their possible needs. Doesn't require a cycle in the waits for graph. (so may be conservative) Could also be crazy conservative if may someday (after unlocking/releasing r1 and r2) want r3 as well: your needs are r1, r2, r3, BUT better expressed as ((r1 and r2) or r3) not sure what happens: P1: acquires(r1) acquire(r2) .... release(r2) release(r1) acquire(r3) P2: acquires(r2) acquire(r3) .... release(r2) release(r3) altered: (no deadlock potential here... just action by the avoidance mechanism) P1: acquires(r1) acquire(r2) .... release(r2) release(r1) acquire(r3) P2: acquires(r3) >acquire(r2) .... release(r2) release(r3) if P2 had r3, P1 must be denied r2. if P1 had r2, P2 must be denied r3. (not surprising) Could potentially achieve something similar to the fine-grained scheme by breaking a process into smaller bits. (updating the operating system when you have diminished requirements) - would benefit from reducing those needs because: our application (and maybe the rest) are more likely to be permitted execution because the OS knows there's less potential for deadlock. BANKER'S ALGORITHM (book 7.5.3) - metaphor of a banker holding money so that everyone can be satisified. Available[resource_id] --- how much the machine has Max[process_id][resource_id] --- limit, for each proc, each resource. Allocation[process_id][resource_id] --- what each proc has of each res. Remaining_Need = Max - Allocation Is_Safe_State routine Work = Available // leave available constant. forall process_id: Finish_able[process_id] = false loop: find a process_id such that ( !Finish_able[process_id] && Remaining_Need[process_id] < Work[process_id]) Finish_able[process_id] = true; Work += Allocation[process_id]; // reclaim resources. goto loop; if we find no such process: if everything's finishable -> we are safe! return true. otherwise -> not safe. Given this routine, what do you do when a resource is requested? make sure that the process isn't exceeding the need (limit it told us about) make sure that the resource is available, (else block). "pretend" that the resource is allocated. invoke is safe state if it's safe, provide. if it's not safe, block the process. === Midterm Overview Structure: ~24 MC, ~3 short, ~2 long. Do not equate Multiple Choice with easy. I know how to game MC tests, which means I can write them so they require that you know the answer. Programming assignments 0,1,2 are fair game. Understand geekos to the extent you were supposed to to complete the assignments. If I ask, "Why do X?" do not answer "Why not do something other than X?" I am never looking for an essay, but if you must write a paragraph, paragraphs have topic sentences. Answer, then support. Don't bury the lede. I'll likely give you a bit of code (to modify) if you need it. I have asked for "code" that amounts to ~3 lines. It's in my interest to give you enough scaffold so that the set of possibly correct answers is small. Will everyone finish? I'd guess 5 people will be here to the end. You know how to triage (answer the ones you know, skip the rest until later). Deadlock detection Every once in a while, see if processes are stuck. Build the waits-for graph, find a cycle in the graph. Once you find the cycle: two options: a) kill a process. for example, if two processes are both waiting for the other to write something, killing one will cause the other to read EOF. maybe this will result in something good happening. b) confiscate resources. (may not be feasible.) The semantics of semaphores in geekos mean you could acquire a semaphore, and die holding that semaphore. - have to believe that the killed process will give up whatever resources others are waiting on. - probably can't build a waits-for graph here. (might be able to if there were only mutexes, and the mutex had an owner, and when the owner died the mutex would be auto-released.) Deadlock detection follows the principle of optimizing the common case. Dining Philosophers (covered in the book, look in the index) round table, n philosophers, n forks, each fork placed between each philosopher. need two forks to eat pasta. (if you don't believe that you need two forks, there's the chopsticks and rice variant) each philosopher is either eating, thinking, or hungry. find a scheme to acquire the forks without deadlock, without starvation. strawman scheme: treat each fork as a resource, when hungry, acquire the mutex for each fork, while holding both, you're eating; release them both to think, repeat. philosopher behavior is: acquire right fork acquire left fork eat release right release left if every philosopher does this at the same time, deadlock. could alter the circular wait problem by having just one philosopher use the alternate scheme: acquire left fork (I swapped these) acquire right fork eat release right release left a scheme for "solving" this is to find a way to acquire the forks at the "same" time, potentially by putting fork acquisition in a critical section... Monitor-using solution: each philosopher in the array has a state (thinking, hungry, eating) chopsticks / forks are tracked implicitly (not as variables on their own) initial condition: all are in state thinking. test(philosopher_idx) if philosopher_idx+1 is not eating and philosopher_idx-1 is not eating and philosopher_idx is hungry then set philosopher_idx to eating. signal (wake up) philosopher_idx (someone else calls test on our behalf) end pickup() // go from state thinking to hungry (and hopefully to eating) set my_idx to hungry. test(my_idx) // then, either we're still hungry, or we're eating. if eating, return if still hungry, wait putdown() // go from state eating to state thinking. set my_idx to thinking test(my_idx-1) test(my_idx+1) might "starve" because the hungry philosopher cannot stop immediate neighbors from grabbing forks. (there's no queue.) Transactions. Two-phase locking. Ordering. (6.9.4) ACID. atomicity - either a transaction happens or it doesn't. consistency - invariants of the database are maintained. isolation - related to serializability, looks as if each transaction runs alone. durability - once committed, the effects last. Want to provide isolation: locks. Serializability: the effects of a set of transactions are equivalent to *some* sequential ordering of those transactions. individual transactions don't see the effects of partially-completed transactions. T1: read(balance) balance+=1 write(balance) T2: read(balance) balance*=2 write(balance) Serializability means the outcomes (assuming T1 and T2 commit) are either: balance = (balance + 1) * 2 balance = (balance * 2) + 1 not: balance = balance + 1 balance = balance * 2 Two-phase locking is the mechanism for ensuring that this happens. phase 1: growing phase phase 2: shrinking phase Once you release a lock, can't acquire any more. T1: lock(balance reading) read lock(balance writing) write commit T2: lock(balance reading) read lock(balance writing) write commit not two phase locking! : T1: lock(bal read) read lock(bal wri) write release(bal) lock(bal2 read) read lock(bal2 wri) write release(bal2) commit two phase ensures that the transaction holds all the locks for all the stuff it needs to read at the same time or write at the same time. In the two-phase protocol, if we're doing transactions (we have the ability to "abort" the transaction) where does the commit go? options are: before releases, after releases. if done after release, and you don't commit, anything that read stuff you had written also has to abort. if done before release, no one else can see the results of the stuff you wrote, so no cascading aborts. Midterm Review Content System Calls Interrupts Process address space Process states Contents of PCBs Scheduling: long, medium, short term Daemons fork, exec, system, dup2, popen, clone socket, bind, listen, accept, connect User-level, kernel-level threads; advantages of each. setjmp/longjmp or getcontext/setcontext context save and restore functions Thread pools, scheduler activations Scheduling approaches: FCFS, SJF, Multi-level feedback; limitations Little's law Race conditions Critical section problem Petersons solution. Bakery algorithm tsl, xchg instructions; limitations. Monitors and semaphores. Binary and counting semaphores. Bounded buffer. Reader/writer. Deadlock conditions Stuff from today. === Agenda Midterm administrative seeing grades. seeing the exam. Midterm Review Memory Stuff you already know: von Neumann architecture. (not harvard: code and data separate) code and data are in the same memory. code can be rewritten as if it were data. (protection bits that prevent writing doesn't break this architecture) Memory hierarchy. fast and large are expensive. registers (super fast, once cycle access) very few. cache (L1, L2, L3...) takes the stream of memory accesses and tries to satisfy them without asking: main memory, dram, ram, (whatever you want to call it) ... disk or other paging device... tapes ... google. etc. Goals: Protection Constrain memory references to each process's address space. Avoid fragmentation (internal and external) Fragmentation: (review) internal fragmentation: some piece allocated, larger than needed, rest is wasted. in a file system, you'll have blocks, some parts of blocks will go unused. external fragmentation: unusable space that results from allocations and deallocations of different sizes. space between allocations may be too small or too separated for a contiguous allocation. "fragmentation" in your disk utility is a combination of things where the "defragmenter" also tries to make files contiguous. Generic performance stuff. Simple base+limit Observe mem.c, userseg.c, userContext->memory, Init_Code_Segment_Descriptor Linker and Loader behavior ldd nm relocation: internal references can be relative addresses. external references must be fixed. program independent code: (alternate scheme for shared libraries) external references are indirect. Swapping distinction from paging. --- Segmentation (in the context of memory) Memory divided into regions based on function "code" segment, data segment stack segment ... the location in memory being referenced is a combination of segment and offset within the segment GeekOS segmentation: uses the hardware to put user code at 0x8000 0000 is the base address of the data segment different? no. is the base address of the stack segment different? no. Paging (definition) Separating the linear addresses in a process's address space from the physical addresses that are split into pages. linear address: split that into a page number p and an offset d p of 20 bits d of 12 bits -> 4 KB page. some sort of table will map p in the linear address to a physical page. we translate p, we can get the physical address. What are the advantages of paging: only have to commit physical memory for space that's used, not just for space that's in the space but not used. (can have a sparsely-populated address space) eventually (chapter 9) can send pages to disk. (much smaller than segments) find a less-used page, drop. can be easy to share how might you implement fork()? copy the page table, and set "copy-on-write" bit The TLB. "Translation Lookaside Buffer" Fully-associative cache that keeps recently-used bindings from linear page number to physical. Limitations on size. In the MMU. Hierarchical Paging Not-hierarchical paging would consume 4MB just for the page table, even if the process used almost none of it. Split page number (p) into page table number, followed by a page number in that table. Intel PAE - skipping the inner page table so that you can make a 4MB "page". Hashed Page table hash(p) to get an entry in the table, chase pointers to find the physical page number on a collision Inverted Page table For each physical page, we can tell what logical page is kept there Relatively small. Big limitation: difficulty sharing. --- Oct 29 PA3 due tonight. scheduler test clarification release tests are (at the moment) 1 and 3 on the spec. behavior of g_needReschedule in MakeRunnable HW6 due Nov 5. mmap exercise will be useful for PA5. Shared libraries. Why? save space: in memory. save space: on disk. reusing code... (could do that with a static library as well) ability to update (bugfix) a shared library - but MUST retain the same interfaces. Where they go. cat /proc/[pid]/maps cat /proc/[pid]/smaps How do shared libraries get "loaded" (dynamic part of linking) how does the executable know where the functions are in the library so that the calls happen correctly? How they work. PIC. procedure linkage table. Demand Paging Goal: Illusion of more phy memory than actually present. Large, unused address spaces, buffers, libraries, etc. Potential to load from disk only what is accessed. Prehistory: overlays Process itself would start, would load initialization routines, run them, and then replace that code in memory with the code it would use later. - no mmu for doing the paging. - 8" floppy would have been backing store. Memory mapping "load" stuff into memory by altering the page table to have a reference to disk. when that page is accessed, "present" bit won't be set, page fault will occur, the OS will then have to load the page into memory. (page table can have an entry) Actors: The "pager" brings pages in from disk, marks them present in the page table. The "swap daemon" which takes dirty pages and writes them to page file. "pdflush" in linux (TODO(find generic name) takes dirty pages and writes them to the file system. (for mapped files). (similar to swap daemon) -- HW6 due thursday PA4 part 1 due a week later (but get it out of the way soon). The states of a physical page unallocated, zeroed page allocated to a process, mapped into that process's page table. not dirty not referenced for epsilon time (once the instruction is restarted) not referenced, dirty -> referenced, dirty: not referenced, not dirty -> referenced, not dirty: process read from that page. (processor does this for us) referenced, dirty -> not referenced, dirty: referenced, not dirty -> not referenced, not dirty: the algorithms we'll talk about will use the bit to tell which pages are important and which ones are not, will likely clear the bit after in some (algorithm dependent) way remembering that it was set. the kernel's job. not referenced, not dirty -> not referenced, dirty: no way. (a write is a reference) not referenced, not dirty -> referenced, dirty: referenced, not dirty -> referenced, dirty: how? processor sees the write. referenced, dirty -> referenced, not dirty: take the page, write it to disk copy the page, clear the dirty bit, schedule the write might just get dirty again. could be a good idea to clean if the page is part of a memory mapped file. not referenced, dirty -> not referenced, not dirty: take the page, write it to disk copy the page, clear the dirty bit, schedule the write preferable. more likely to provide a not-referenced, not dirty clean page ready to be reclaimed. not referenced, not dirty -> reclaimed. part two of the page replacement algorithm. reclaimed -> zeroed (and ready to use) this is a task... processor takes care of these bits: dirty bit, set on a write: clean / dirty bits in the page table. referenced bit, set on read or write: referenced / not referenced bits in the page table. Page replacement algorithms (some of these will work with the referenced bit, some of them don't need it, so likely existed before the referenced bit was kept by the processor) Policy 1: FIFO. That which we paged in the longest, can be removed. Good: recently brought in stuff is probably useful. Bad: older stuff might be highly referenced. Three-page example with Page references: 1 2 1 3 4 1 3 1 2 time -> : 1 2 1 3 4 1 3 1 2 Frames: 1 |.1 1 1 .4 2 | .2 .1 1 3 | .3 3 .2 6 faults. the "which frame am I going to use next" counter. increment, mod, done. This scheme may actually do worse (have more faults) with more pages available. (given a bad sequence of page references) Policy 2: OPT. - replace the thing that will not be used again, or be used again furthest in the future. - what would we change from the fifo example? time -> : 1 2 1 3 4 1 3 1 2 Frames: 1 |.1 1 1 2 | .2 .4 .2 < 2 could go anywhere (there is no future) 3 | .3 3 OPT for this stream: 5 faults. Policy 3: LRU - least recently used - replace what has not been used for the longest time. implementation option 1: global counter (time). increment on each reference, assign that counter to the page on each reference. to find eviction candidate find the page with the lowest stored counter value. implementation option 2: pointers. (happy linked list. move to front on each access.) - "stack" algorithm idea: the set of pages in memory of size S is a subset of the pages in memory of size S+1. (so no Belady's) : 1 2 1 3 4 1 3 1 2 Frames: 1 |.1 1 1 2 | .2 .4 .2 < for LRU, 2 must go there. 3 | .3 3 Policy 3.9: FIFO + second chance - do most of fifo, except don't actually evict if the Referenced bit is set. - clear the referenced bit and place back in line for fifo eviction. (as if you had just brought it back in) Policy 3.99: Random. - there are processors/operating systems that just kick stuff out at random. Policy 4: Clock - have the referenced bits - clear or eject while advancing the hand of the clock. - can also have a hand involved in cleaning pages. - can also adjust the space between hands that will clear the bit vs evict the page. - mostly a "visually cooler" version of fifo + second chance, (subject to revision) Policies 5 and 6: LFU, MFU - periodically scan all the referenced bits, move those bits into some sort of counter. maybe you increment, maybe you shift onto an integer and count the bits in that integer - the stuff that is not often used is a better candidate for eviction. - (or you could believe the opposite: that we're done with stuff that was used extensively already. Policy 7: Working Set - anything that hasn't been referenced in time \tau (made up constant) isn't really being used, and is equivalent... anything that old can be evicted. - if \tau was chosen poorly, might be in trouble. (but you consider all pages older than \tau equivalent, so performance won't drop cliff-like, because you're likely to choose things much older than \tau as well.) (assuming the randomness is mostly random) Policy 8: Working Set + Clock. (current linux scheme) - clock advancing tells when to clean dirty pages - working set determines which clean pages to evict - and if working set fails to find a page not referenced longer than \tau ago, use clock. -- no more page replacement policies. when should you clean a dirty page? a) need the memory. might lots of read-only pages out there... e.g., code. make up a threshold of the number/fraction of dirty pages we're willing to tolerate. b) the page is a good candidate for eviction (because it hasn't been referenced) c) (potentially) the disk has some free time. ---- ==== HW6 please. PA4 part 1 due thursday. Consider: Have: read-only bit (or write-able bit), valid (present) bit. Also have: referenced bit, dirty (modified) bit. Do you need a referenced bit provided by the hardware? the processor does not set the bit for you. try to use an algorithm that does not use the referenced bit: e.g., random could: mark the page protected/absent/(anything that would cause a page fault on an access), then keep a referenced bit in extra memory. forces the kernel to take control on an access of a page that doesn't have the fake-referenced bit set. Do you need a dirty/modified bit? no, version 1: assume all writeable pages are dirty. no, version 2: set the page read-only, on a write, you'll run the page fault handler, can then if not supposed to be writeable, kill process. bad process. if supposed to be writeable: mark as dirty in our extra data structure mark it as read/write so that all further writes don't need to trap. (until made clean) a) difference between fifo+second chance and clock justifiable answer #1: they're the same, just that clock has a single circular list, while fifo+2C implies messing around with pointers to add to lists. justifiable answer #2: they're the same, but it's obvious how to give a page a third or fourth chance in the fifo+2C option. (maybe also obvious in clock how to add per-page counters if you wanted) unjustifiable, unsupported answer that seems compelling to me: fifo+second chance involves two "queues" and doesn't need a referenced bit. "replace" the page, but really just mark it "invalid" and move it to a list of pages that are really in memory, just not known to the processor as being still in memory (marked not valid). when a page fault happens, move the page back onto the active list. b) difference between ws and wsclock neat overview: http://cs.nyu.edu/courses/spring02/V22.0202-002/wsclock.html (it works, but perhaps because it combines many ideas each individually too small to study) authoritative: http://portal.acm.org/citation.cfm?id=806596 Allocation problem. (allocate as in choose how to distribute finite resources, not how to track free frames and respond to malloc() requests remember: No premium on contiguous frames in memory. How do we allocate CPU to each process? How do we allocate physical page frames to each process? Potential problems that page allocation should solve: - Perhaps we'd be better off swapping a process out entirely to avoid thrashing. - Each process that intends to make any progress will need at least some minimum number of pages in core to execute some instructions. (can't have more processes than pages) - A process of unclear or low priority could interfere with a high priority process by consuming too much physical memory. Let's assume my process needs another page: Options in how to alter the number of pages assigned to a specific process: Option 1: global, indifferent allocation. Find the oldest page in the whole system, kick it out. Potentially unfair to any process that has only a few pages in memory. May not prioritize. Option 2: (impractical) each process states its in-core memory requirements up front, works within that constrained allocation. Option 3: Fixed allocation scheme everybody gets the same. Option 4: Proportional allocation scheme everybody gets in physical memory some fraction of what they asked for (that might not be in core) Option 5: Priority allocation when a high priority process wants another page, take a page from any lower-priority process. How might the kernel tell that it's thrashing? Thrashing: unhappy state where much more time is spent paging stuff out/in than getting work done, which could maybe be avoided by medium-term scheduling. page replacement algorithm running a lot. lots of page faults. the clock is spinning. (lots of referenced bits are being set) in the working set scheme: all pages appear to be in the working set, accessed more recently than \tau ago. (maybe \tau is too large) many processes (not just one) taking faults. very little user time, mostly kernel time. (probably false positives) Virtual memory trick number 1: Memory mapped files What did mmap() do? Makes a set of virtual pages backed by a file (on disk). Some set of data structures will be initialized so that when a page fault occurs in that region, the page will be loaded from disk. When does a page get written back to disk? msync() OOM killer: killing big not(?) long running 1) done a lot of work, don't want to waste that. 2) or, might be the real problem not running as root not running high priority not init not xlock ==== Nov 10 Task: as soon as P4-B's deadline passes, be prepared to complete P5-A, due less than one week later. Postponing, for now: * performance tricks in demand paging: prepaging, page caches * for application writers: how not to fault on every instruction * memory allocation in the kernel, including slab allocator * buffer cache Directly to files (p5 relevant) * Project 5-A: format a file system, 5-B: read, 5-C: write. Postponing in files: * file system "types" and formats. What is a file? to a woodworker: finer than a rasp, more reusable than sandpaper. general: a named array of bytes, persistently stored, likely with some metadata / access restrictions / type. to the file system: blocks on a disk, metadata about the file metadata stored by the filesystem: size, atime (access time, when last read) ctime (creation time), mtime (modification time), mode (permissions), ownership, group ownership. metadata that can be stored by the OS in the filesystem "icons", .DS_Store, thumbs.db (cache of the images in a directory to make your windows explorer look pretty). to an application: an object that has a name and supports the file operations: read, write... What is a directory? A file with a special format. <- (nearly) everything that we want to have persistently stored will be a file. - list of other files (and directories) Operations On a file: open read write close seek On a directory: opendir readdir write^H^H^H^H^H - no writedir. interface to modifying a directory involves creating, removing, or renaming files, not arbitrary write. closedir seekdir What does a directory look like on disk? 1. its a file 2. contents are concatenated "directory entries" (dirents) (inode number, filename) pairs. (inode number, length of the name, filename) (catch: very likely to word-align the inode number) Two ways of having more than one name for the same contents: Hard link: two identical references to the same inode (using the same inode number), incrementing a reference count. Cannot cross file systems: inode is unique only within a file system. Soft link: a file that stores the name of another file; the os is in charge of opening that other file instead. "symbolic" link. There exist file locks (man flock) Information the kernel keeps about open files open returns a file descriptor file descriptor is an index into a process-local array of open files open file likely points to a structure that represents the file Per process: "pointer" or position within the file conditions of the open (read/write or not) Per file: contents, location in memory, lock state. Disk: Physical device: cylinders, tracks, sectors, heads, circular platters that spin, bringing a sector under the heads 5400 rpm on laptops, 7200 on cheap desktops, 15000 on servers. 11ms for the entire cylinder to run under a head at 5400rpm ~3.7ms at 15000rpm heads attached to arms that seek to a cylinder seek performance varies by distance to travel What does this mean for performance: sequential access (if you arrange adjacent blocks on the same cylinder) is fast leads to my favorite thing: the log structured file system. stuff that will be fastest to access is in the middle: part of the "fast file system" (second unix file system) moved the inodes to the middle. less distance to seek. What does the structure of the physical disk mean for caching? May want to just read the whole cylinder into memory, expecting the application to want to read sequentially. File system layout: Mirrors memory allocation strategies Contiguous allocation (much like segmentation in memory) record where it starts, and how long it is. advantages: sequential access is necessarily fast. know exactly where to read on a random access. disadvantage: external fragmentation: can't use some space because bits too small to be usable are stuck between allocations. internal fragmentation unchanged: still have blocks that won't be fully used. appending to a file might fail or require big copying. Linked allocation: Store in the inode (on-disk description of the file) the location of the first block in the file. Each additional block is reached through a link. Conceptually at the end of the block; In reality all of these pointers are in a table. advantages: get rid of external fragmentation; your "defragmenter" is more of a compaction tool. can keep all the "free" blocks in a file. disadvantages: sequential access might be a problem (require seeks) random access might be miserable (unless the table is entirely in memory) vulnerable to corruption in the table of block pointers. possible to have more than one pointer to the same block Next: Indexed allocation, multi-level indexed allocation. (these approximate project 5 schemes) --- Basic file systems; Aaron guest lecture. --- P4 - ensure that user code fails on a null dereference. (release test) ignore the segment that has both zero length and zero as its desired address. Review indexed allocation, multi-level indexed allocation Indexed vs. FAT: FAT: File's blocks are represented by the first block number from the first block number, can walk the array of blocks to connect the allocation. if the FAT gets big: might not want to hold it all in memory (have to go to disk to get it) might want to have had larger blocks (internal fragmentation) Indexed: file's blocks are represented by entries in an array in the inode, overflows into an indirect block, then overflows into a doubly-indirect blocks. How large a file: FAT: as large as the file system. Indexed: as large as the indirect blocks permit Where on disk are the inodes or FAT stored? initially: wherever convenient for the programmer (outer cylinder) intermediate: move to the middle What makes the fast file system fast (1) Fast File System: scattering the inodes into block groups. rather than put all the inodes together, make little block groups where some inodes and some blocks would live together. allocation goal: for an inode, if it needs another block, try to allocate within the group or within a nearby group. What makes the fast file system fast (2) Stores "fragments", taking the last part of the file, and storing it in a block shared with other files. What does a directory look like? File with specific formatting ( inode, record length, name length, name) struct gfs2_dirent { gfs2_inodenum inum; /* *must* be word-aligned */ unsigned char entry_length; /* after the inum. i.e., the min entry length is 4. max is 252, for alignment */ unsigned char name_length; /* may be < entry length - 2. */ char name[2]; /* where overrun provides the rest of the name; NOT (necessarily) null terminated. */ }; What's the difference between "ls" and "ls -l" -l needs to look at the metadata about the file, which means reading their inodes. What does an empty directory look like? two files: . and .. what are the name_lengths? 1 and 2 what are the entry_lengths? 4 and 4 (not including the 4 for the inode number, which would be redundant) what are the inode numbers? who knows. The superblock (and its replicas) store the parameters of the file system, e.g., block size. How do you find free blocks in the file system? Store an explicit bitmask of the used or free blocks. Where? not the superblock (might not be big enough, don't want to keep writing it in case we write it in a corrupt way) IN A FILE. though if the file system dies unexpectedly (without sync) that bitmask may be inconsistent with reality. EXT2 magic: EF53 preallocation. set aside some blocks of space for a file you're actively writing to. expect: if the machine is writing two files concurrently, it won't want to read them that way. keep the blocks from files together. claim blocks before they're needed so that later writes will construct sequential files. The buffer cache: read; if in cache, no disk read involved. no sleeping, no context switch, happiness. write; if you write, anyone who reads will see it immediately. but the disk might not. cache might absorb all changes. the task of sync is to flush this cache. ensure that every dirty buffer becomes clean. Tracking free space The buffer cache Log-structured file system Journaling file systesm