** begin lecture 2 man 2 socket ;; the 2 is for system calls; ;; open, close, read, write ;; a 3 would be for library function ;; printf, fgets, fopen Create a socket: socket(AF_INET, SOCK_DGRAM, 0); ^^^^^^^ "internet" : ip addresses, potentially ipv4 addresses ^^^^^^^^^^ datagram -> UDP (messages that fit in a packet) (as opposed to SOCK_STREAM ->TCP, text, ordered bytes, reliable) ^ switches between datagram socket implementations, of which there's only one. Fragmentation ; to be covered a bit more later. Large IP packet, typically larger than 1500 bytes, would be split into fragments to be reassembled at the destination. For long-distance links, we use Path MTU discovery. MTU == Maximum Transmission Unit ; for Ethernet, 1500 bytes. This informs the sender what maximum size packet won't experience fragmentation. Fragmentation does happen; typically for NFS (network file system) traffic. Servers tend to be local because they trust IP addresses, you want to send 4096 bytes at a time. The maximum size of an IP packet is 65,535 bytes. (deduct 20 for the IP header, 8 for the udp header) Jumbo frames (gig ether, -> ~9000 byte MTUs) to be loosely standardized. Small MTU -> multiplexing (other people get a chance) ability to detect bit errors. (2-bit errors, 3-bit errors) Large MTU -> efficiency, perhaps fewer headers wasting bandwidth, fewer times to lookup where a packet goes. Ports 1. Ephemeral - allocated to clients, doesn't matter what they are, allow the kernel to tell which conversation a packet belongs to. 2. Bound - allocated to servers, often well known ( 41710 ), allows clients to contact a specific service. Conversation is identified by the 5-tuple: Source port, Destination port, Source address, Destination IP address, Protocol (TCP or UDP) Implication is that you can send a message to one of our servers (at dest port 41710) from a socket (bound to) a different port. /etc/services lists all the IANA-allocated ports. ssh = 22, http = 80, ntp = 123. assignment: send/broadcast a message, then wait for messages and print them. man 2 bind ;; all the include files you need, are listed here. bind - will give us a port. we can ask for one, or let it come. bind - the addresses and ports are in network byte order. #include struct sockaddr_in { /* internet endpoint address, IP address and port */ sa_family_t sin_family; /* AF_INET ; PF_INET */ in_port_t sin_port; /* ...sin_port = htons(41710); */ struct in_addr sin_addr; /* ...sin_addr.s_addr = "224.0.50.112"; */ /* ...sin_addr.s_addr = inet_addr("224.0.50.112");*/ /* ...sin_addr.s_addr = htonl(0xc000.....);*/ /* ...sin_addr.s_addr = INADDR_ANY; if you were a server */ /* ...sin_addr.s_addr = INADDR_LOCALHOST; */ /* ...sin_addr.s_addr = inet_addr("127.0.0.1"); */ /* ...sin_addr.s_addr = htonl(0x7f000001); */ */ }; ** that was an address, how to fill in the fields setsockopt allow more than one process to bind the same port. --- for most servers, this would be bad. int one = 1; setsockopt(socket, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)); // the ability to attach more than one process (socket) // to the same address and port // on BSD -- there is REUSEPORT option. before the bind. allow us to configure the multicast membership bits. --- subscription to a multicast address. as a sender, you're broadcasting, only interested receivers will pick the packet off the wire. ** begin lecture 3 struct ip_mreq mreq; memset(&mreq, 0, sizeof(struct ip_mreq)); ^^^ treated as a character pointer ^^^ set bytes to zero ^^^^ that many bytes. // bzero(&mreq, sizeof(struct ip_mreq)); // another "optional" field might get added mreq.imr_multiaddr.s_addr = inet_addr("224.0.50.111"); // use the right address. mreq.imr_interface.s_addr = htonl(INADDR_ANY); // setsockopt will return -1 on error; 0 on success. if(setsockopt(receiving_socket, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(struct ip_mreq)) == -1) { perror("setsockopt!"); fprintf(stderr, "%s: %s\n", "setsockopt!", strerror(errno)); exit(EXIT_FAILURE); } // first parameter is the socket. sendto(sending_socket, buffer_to_send, length_of_buffer, struct sockaddr *destination_adress, socklen_t length_of_dest_address ); struct sockaddr { int family <--- subclass. struct sockaddr_in { // I think this is 12 bytes long int family == AF_INET struct sockaddr_in6 { // I'd bet this is at least 32 bytes long. int family == AF_INET6 HOW to CONSTRUCT a BUFFER struct packet { struct header { // all those fields } hdr; char data[0]; // instead of zero, be 0xffff - 28 - sizeof(header) (or so) char *data; // big mistake. }; char *packet; ((struct header *)packet)->version = 1 [ header ][ data ] in some region of memory you can point to. don't call sizeof(struct packet); the size of the packet is in the length field. * you keep track of it. not the compiler. one call to sendto per packet. (no fragmentation). sizeof(struct packet *) => 4. -- why not to use the pointer scheme. struct packet *the_packet = malloc( 0xffff ); the_packet->data if done the right way == (char *)the_packet + sizeof(header) if done the other (bad; the pointer) way, it'd be zero or some unitialized bytes. struct header *the_header = malloc( 0xffff ); memcpy(the_header + 1, "hello", 5); printf("%p %p\n", the_header, the_header + 1); recvfrom(receiving_socket, buffer, maximum_size_of_buffer, * address, * length); => returns the number of bytes read. address is in/out parameter. in => restrictive, anything but inaddr_any (all zeroes), it will only return a packet from that source. source address to match on. source address is an ip address and UDP port; ip is end-to-end; original source (as IP believes) but most likely, make it all zeroes. out => from which source. for us, we *could* send a unicast response right back. the buffer won't have this information for us. Debugging: printf. inet_ntop I think will convert IP addresses to strings. __FILE__ __LINE__ macros can print where you are. gdb. break at one.c:50 strace - what system calls your code invokes, with what parameters. << anyone can run it. ltrace ... dunno if it's useful here... might not be installed. tcpdump/wireshark - what bytes are in what packets being sent out. << you need to own the machine. man 2 sendto ;; takes the struct sockaddr_in as destination Back to course content, no more programming assignment stuff. ===== Reliability. build a network from cheap components you can get at Fry's. from the overpriced components you can get at ... -- cheap oscillators (telling where a 1 and zero is. -- cheap wires. (original ethernet used catv trunk wire... twisted pair used copper telephone wiring) events outside our control. backhoes dig up wire. "baltimore tunnel fire" -- reasonably large outage on the east coast. reboot routers. power failures laws of physics. noise from external interference, speed of light, attenuation, fading, multipath Phy - encode the bits with enough redundancy that we recover them all. Data Link - retransmissions. if you believe it didn't get received, send again. Transport - retransmissions in TCP use "cumulative acknowledgement" Phy - Problem: put bits on a wire. Subgoal 1: High bandwidth (throughput) (to have short bits) Subgoal 2: Low latency (delay) the first bits should get to other side quickly. NRZ, Baseline Wander, Clock Recovery. ** begin lecture 4 if you're sending "hello" [ header, with all integer fields in network byte order ] [ hello ] not [ lleh\0\0\0o ] only things to convert are the 32-bit integers, and 16-bit integers. htonl htons small trick when calculating the checksum. (but not for this one!) Weird issues: ** some csic machines act lame. ** "nauseated" does not seem to be among them. <- login there. ** send me mail with which ones work and don't. ** if you can explain it correctly.... I will be impressed. CLASSROOM ETHERNET -- we will eventually talk about how to share it. -- for now, use it as an motivating example for why we care about clock recovery. -- clock recovery: have to find the middle of the 1 or the zero and be able to tell when there are several consecutive 1's or 0's. (stadium concert video example) Encoding schemes: -- NRZ encoding : non-return to zero direct: 1 is high, 0 is low. RS-232 -12v => 1, +12v => 0 (pretty resilient) internal PC bus (but architects get to cheat, they have a clock) downside: no help for the clock. clock recovery: repeated 1's or repeated zeroes could get lost. baseline wander: spends too long at some level, average tracks up (or down) into the noise. what we need: transitions. 0's and 1's in equal parts, and frequently changing. (transitions help with both problems (clock recovery and baseline wander)) -- NRZI to encode a 1: change (from low to high or high to low) to encode a 0: no change 11111111111 (original signal) HLHLHLHLHLH (H = high, L is low) solves half the problem (consecutive 1's) doesn't solve the other half (consecutive 0's) -- Manchester Encoding to encode a 1: ( high/low ) to encode a 0: ( low/high ) 1 1 1 1 1 1 1 1 1 1 1 (original signal) HLHLHLHLHLHLHLHLHLHLHL 1 0 1 0 1 0 1 0 1 0 1 (original signal) HLLHHLLHHLLHHLLHHLLHHL Good: each bit gets a transition. Bad: send at half the rate. (many transitions encode nothing) this is in all ethernet before 100 Mbit. -- 4B/5B 4-bit sequence in your message, turn it into 5 bits on the wire. Have a table. This table will ensure that there are never more than three consecutive zeroes. 16 possible 4-bit sequences. 32 possible 5-bit sequences. not gonna use 00000 not gonna use 00001 not gonna use 00010 not gonna use 00011 gonna use 10001 (maybe... not in the list) not gonna use 11000 not gonna use 01000 could totally use 11111 could totally use 11001 catch is you put two of them together. all valid codewords start with at most one zero, and end with at most two. (verified!) rule eliminates less than 16. rest can be used for framing / control there is a table in the book, you could double check my rule generated by no notes. -- 4B/5B + NRZI recall 4B/5B: Have a table. This table will ensure that there are never more than three consecutive zeroes. recall NRZI: solves half the problem (consecutive 1's) 0100 1000 0110 1001 (original signal 0x4869) 01010 10010 01110 10011 (to 4b/5b) spaces not transmitted 01100 11100 01011 00010 (add nrzi) scheme in fast (100mbit) ethernet. wikipedia pages are fairly good... (it's not important to me that you could match the encoding scheme to the ieee standard.) -- scrambling SONET (optical, serious metropolitan or long-distance networks) 0100100001101001 (your bits) 0110010111010010 (*random* signal everyone has agreed on) 0010110110111011 (xor, and pray) advantage: no extra bits. disadvantage: could be unlucky. (ideally, you would not get unlucky more than once) devices that implement this are expensive. squeeze out as much performance as possible. might mean that you don't need as many transitions most of the time? not fighting three consecutive bits, maybe fighting 20 consecutive bits. based on not assuming random data. * if you believed all traffic to be encrypted, you wouldn't need scrambling. * that doesn't happen. FRAMING - mark the beginning and the end of a string of bits (frame) Options: (i) 4b/5b codewords (symbol outside the vocabulary) (ii) sentinel (like double-quote in programming language) (iii) fixed-size frames (using timing) 4b/5b scheme: 10001 (valid codeword, not in the table) stick that at the beginning and end of a frame. HDLC as used in some PPP. append 01111110 I believe there is both a bitwise version (counting 6 1's after a zero) and a byte-wise version (seeking 0x7e as a frame delimiter) For HDLC, just like a \" in a string, you have to escape: adding an extra zero after five 1's (regardless of whether a zero or 1 follows the five 1's.) [ check with the text ] If the end-of-frame marker is in the message, the sender will add that zero to make sure that it's not in the message. The receiver will remove the zero. What if we want to send 011111[0]0 somewhere deep within the frame (so it would have frame markers)? Still have to stuff! [011111]{0}[00] where [original] {stuffed bit} Just like in the quoted string... you have to be able to recognize the escape (as in "c:\\" where you want the escape character) Means you have some probability of lengthening the frame. (not free) (iii) fixed size frames not all that interesting. ATM cell is 48 bytes long (plus 5 of header). if everything is of this fixed size, don't need to waste any bytes delimiting the beginning and end of a frame. may be some periodic clock-synchronizing signal. ERROR DETECTION. Version 1: Parity Bit count the 1's, if odd, parity bit is 1, if even, 0. add the 1's. chance of detecting a single bit error? 100% (the error might be in the parity bit too) chance of detecting a two-bit error? 0%. if you have only 8 bits to send (a character over a serial connection) Version 2: Checksum. instead of adding the 1's, add the 16-bit words.o "internet checksum" is the ones complement of the sum of the 16-bit words. if you store the checksum as the negative sum of 16-bit words... to check, just *add* the 16-bit words and hope you get 0. an aside about handling the carry when adding 16-bit words for a checksum. 0xffff 0x0002 ------ 0x0003 nevermind that. I will bring in some code. chance of detecting a single bit error? 100% (the error might be in the parity bit too) chance of detecting a two-bit error? depends on which two bits. I'd guess it's still 80%-like... considered to be rather weak. (relatively few bytes to protect, have another scheme at your disposal.) ** lecture 5 Oscilloscope: delay, noise, collisions, modulation, preamble. CRC: like checksum, but with division. 2-D parity: simple error correcting code. Submit server woes: "could not run test process" - that's the submit server being lame. don't turn in a compiled version of "zero". it can cause the build to fail so that your code doesn't execute. Assignment hints: PA1 due friday. use two sockets; you'll need them for PA2. *Don't* listen to only your packets. Interoperability is the goal. *Don't* crash on a bad packet. Don't print bad packets. Bad means invalid length, newer version, unknown protocol (maybe more) "Robustness principle" <- be conservative in what you send, liberal in what you accept. (though in this set of assignment, don't print invalid packets, it just means, don't reject packets having a source address or dest address, or for any other reason I think is arbitrary) There's no need to fork. Oscilloscope. fun for me. CRC we end up adding to the packet the *remainder* of a division. Division over a binary field with no carry. Agreed upon divisor. As an example: 10011 -- x^5 + x + 1 * important that the first and last bits be 1. * different links can require different polynomials Message: 1101011011 Divide the message by 10011 to find the remainder. Only other trick is that there is no carry. 1-1=0, * 0-1=1 *, 1-0=1, 0-0=0. ._____1100001010 10011 | 11010110110000 -10011. . ------. . 010011 . -10011 . ----- . 000001011. -10011 ----- 10100 -10011 0111 <- that is the number we want. More fun: 2-D parity. 16 bits to send. : 48 (H) 69 (i) 0100-1 1000-1 0110-0 1001-0 ---- 0011 0<- not the diagonal, just the row or the column (both are the same) 25 bits to send 16. 0100-1 v 10I0-1 x not good. 0110-0 v 1001-0 v ---- 0011 0 v vvxv v if just one bit is broken, can correct it. 0100-1 v 1000-0 x 0110-0 v 1001-0 v ---- 0011 0 v <- the zero can check the parity bits. vvvv x can correct the parity bit 2x2 rectangle of corruption can go undetected. 0100-1 1II0-1 killed four, specifically chosen bits. 0OO0-0 and can have an undetected error. 1001-0 ---- 0011 0 Playing with errors (aside) Burst errors - many errored bits, all at once. - many potential reasons for burst errors. Single bit, random errors - should be easy, cksum, crc. Convert burst errors into fake single bit errors. taking "rows" of bits and turning them into columns. ** lecture 6 Homework 1 return. not in alphabetical order. my ta is slacking! (four left, after class) PA1 questions, I'm sure. ARQ Sequence Numbers and ordering. Debugging PA1: I'm sending but not receiving. * if you copied and pasted the header structure, you may be missing source_address, because it was gobbled by the comment that went off the edge of the page. I'm receiving but not printing. It works fine for me, but not on the submit server. * you didn't put the port back to 41710. bind() fails on nauseated. eventual public shame threatened. * don't have to reuseaddr on the socket you don't bind. remind neil to un-secret-ize the currently secret tests. Reliability: phy layer; ensure that bits are reasonably likely to be recovered. clock recovery, baseline wander, redundant bit encodings - 4b/5b. data link layer; often crc. ARQ. (checksums, even correcting codes like the 2-D parity) transport, network; often checksum. ARQ. ARQ: What do we do if the CRC doesn't match, if we don't see the end of frame, or if there are two-bit errors in 2-D parity? (or any of several other errors we haven't talked about yet) "automatic repeat request" acknowledgements and retransmissions. "please send that again" "I've received 1, 2, and 4" (implicitly saying "3" is missing) "I've received everything up to 4" (which might imply "5" is missing) "I've waited and waited for her to call. I will call her again." Assume for the moment that we have several packets to send. (more than just 1500 bytes). * if we know that the destination can take ten packets at a time, send ten packets at once, and wait for the receiver to tell us which ones they're missing or if they received all of them -- if they don't say anything, wait... 5 seconds and then try again. [[ push this idea on the stack, so we can describe a simple-as-possible scheme ]] We can send only one packet "at a time". -- if we have only one wire, only one radio medium., there's no reason to send more than one packet without getting feedback. Label these packets "0" or "1". Send packet 0. Get ack for packet 0. Send packet 1. Get ack for packet 1. Send packet 0. Get ack for packet 0. Send packet 1. Get ack for packet 1. (( in the layer diagram, the network layer helps us cross different links, potentially of different types, transport layer sits atop that, so the constraint about having only one wire would be inappropriate for transport layer (i.e., TCP) stuff. so TCP has a sequence number much larger than one bit.)) Send packet 0. Get ack for packet 0. Send packet 1. *LOST permanently, dropped* not gonna get an ack! Resend packet 1. Get ack for packet 1. Send packet 0. Get ack for packet 0. Send packet 1. Get ack for packet 1. Another situation Send packet labeled 0. Get ack for packet 0. *LOST permanently* not getting the ack. don't know whether it was the packet that got lost or the ack that got lost. No one suggested retransmitting the ack! Great! Sender responsibility is common. Resend packet 0. Get ack for packet 0. Send packet 1. Get ack for packet 1. Send packet 0. Get ack for packet 0. Send packet 1. Get ack for packet 1. If packets can go off into some corner of the network, and so, when retransmitted, there are two copies of the same packet labeled "0", an older 0 can sneak in before the next zero, causing badness. Send packet labeled 0. disappears. Resend packet 0. Get ack for packet 0. Send packet 1. Get ack for packet 1. Send packet labeled 0. Prior packet 0 is resurrected, arrives before the new packet 0. Get ack for packet 0. Send packet 1. Get ack for packet 1. We can use a one-bit sequence number IF packets can't hang out and appear at inconvenient times. On a single wire. One-bit sequence number will not work well on a large network because of the potential for a delayed packet to sneak in. If the round trip time is large, you can't take advantage of the bandwidth by sending more than one packet at once. How big a sequence number space do you need? if not over a single link. One bit is not enough, how many do you need? * fill the pipe. product of the bottleneck bandwidth and delay. (multiply by two... I think) -- this is enough to get the performance available ** bytes or packets per second times round trip time times 2. * large enough that no previously-transmitted (potentially duplicated) packet can sneak in. TTL; time-to-live; supposed to be decremented by 1 every second that a packet lives in the Internet, in practice also decremented on every router it passes. - *mechanism* for keeping packets from living a long time. - ensuring that packet doesn't consume infinite network resources. (doesn't loop forever) MSL "maximum segment lifetime" we assume that all the packets will be cleared in at most this long. 2 minutes. TTL is at most 120. In all likelihood most implementations set it to 64. ** bytes or packets per second times MSL (I'm leaving it without the times 2, but I'm not certain) TCP numbers bytes. * why? ** if you get an out-of-order TCP segment, you know exactly where in the buffer these bytes will go based on the sequence number of the first byte (you don't have to guess based on the sizes of the missing packets). ** [ could imagine instead of IP fragmentation, allowing TCP-level fragmentation so that the TCP pieces would arrive. Similar thing with NATs ] ** "telnet" people would actually type. telnet over TCP. what happens when you type? each character gets sent in its own packet. (small white lie.) what happens if a character-in-a-packet gets lost? you'll retransmit the character. what if you type fast? you can retransmit all the characters at once -- TCP can re-segment previously transmitted data. 3 n 4 s <-lost 5 p 6 r <-maybe also lost? retransmission: 4 spr ** or for path mtu (mtu==maximum transmission unit) discovery (alternative to fragmentation) send a large packet, get an error, split it into smaller packets, each of which will fit without fragmentation. mtu associated with your interface. 1500, aside from tunnels. path mtu, smallest mtu along the path to the packet's destination. as link bandwidth increases, less reason for a small mtu [ 1500 byte IP | GRE | 1460 byte IP | TCP | something ] You could calculate the maximum performance of TCP knowing that it has a 32-bit sequence space. 4 billion bytes. - "actually might be a problem" ; physicists stumble upon it. ** lecture 7 PA1 Review. PA2 Distribution; discussion embargo lifted! techniques you'll need. HW3 Distribution. 1 #include 2 #include 3 #include 4 #include 5 #include 6 #include 7 #include 8 9 struct msgFrame { 10 uint8_t version; /* must be 1 */ 11 uint8_t ttl; /* must be 1 */ 12 uint16_t payload_length; /* bytes following the header */ 13 uint32_t account_identifier; /* digits of your account name */ 14 uint32_t source_address; /* unused for now */ 15 uint32_t destination_address; /* unused for now */ 16 uint16_t checksum; /* unused for now */ 17 uint16_t protocol; /* must be 1 */ 18 char* msgString; /* stoes the string to be transmitted */ 19 }; struct msgFrame *f = malloc(0xffff); f->msgString = ? memcpy(f->msgString, "hello", 5); /* would barf! */ 20 21 const int DEBUG = 0; 22 const char* HOST_ADDR = "224.0.50.111"; 23 const int HOST_PORT = 12345; int main(int argc, char* argv[]){ char* str_to_send = NULL; // stores message from cmd line int sd = -1, sd2 = -1; // socket descriptors struct sockaddr_in addr_local; // to send to the multicast server struct sockaddr_in addr_multicast; // to recv from the multicast server struct ip_mreq multicast_request; // request to join multicast server const char* multicast_ip = "224.0.50.11"; unsigned int multicast_port = 12345; 18 #define PORT 12345 19 #define IP "224.0.50.111" 20 #define MAX_BUFFER 65535 21 22 struct header { 23 uint8_t version; 24 uint8_t ttl; 25 uint16_t payload_length; 26 uint32_t account_identifier; 27 uint32_t source_address; 28 uint32_t destination_address; 29 uint16_t checksum; 30 uint16_t protocol; 31 }; 32 33 struct packet { 34 char * message; 35 struct header hdr; 36 }; 37 18 struct header{ 19 uint8_t version; /* must be 1 */ 20 uint8_t ttl; /* must be 1 */ 21 uint16_t payload_length; /* bytes following the header */ 22 uint32_t account_identifier; /* digits of your account name */ 23 uint32_t source_address; /* unused for now */ 24 uint32_t destination_address; /* unused for now */ 25 uint16_t checksum; /* unused for now */ 26 uint16_t protocol; /* must be 1 */ 27 }; 28 29 //set up the header 30 struct header h; 31 h.version = 1; 32 h.ttl = 1; 33 h.payload_length = sizeof(argv[1]); h.payload_length = 4; /* same as. */ /* also not htons'd */ 34 h.account_identifier = 99; /*not htonl'd*/ 35 h.protocol = 1; /* not htons'd */ 36 h.account_identifier = htonl(036); /* octal */ h.account_identifier = htonl(30); -- Programming Assignment 2: maintain a neighbor table. ad hoc network with wireless nodes, they like to track who their friends are. just above the physical layer, but not yet routing. want to learn who we can reach directly; can later advertise this information to our neighbors so that everyone can reach everyone else. "hello" message. (sort of as in OSPF). we just send it. receiver will track who he receives it from. build a table of nodes from which he's received a hello message recently. "soft state". information that is essential, but not persistent. other examples include the arp cache. If the neighbor dies, he won't send hello. If the link dies, we won't see hello. both cases, we should drop the neighbor. Table for this assignment includes: IP address and port of the neighbor. out of the recvfrom. I would like for the port to not be 41710. second "sending" socket -- we can send unicast responses only to your process later. (the "sending" socket's port should be not-reused; your own; exclusive.) unicast_socket = socket(AF_INET, SOCK_DGRAM, 0); sendto(unicast_socket, message, length, dest_addr, sizeof(struct sin_addr)) now unicast_socket has a port, arbitralily chosen Two examples: /* will work well; will presumably have only one instance in your neighbor's neighbor table. you're holding on to the global socket */ unicast_socket = socket(AF_INET, SOCK_DGRAM, 0); sendto(unicast_socket, message, length, dest_addr, sizeof(struct sin_addr)) sendto(unicast_socket, message, length, dest_addr, sizeof(struct sin_addr)) both messages have the same source port. /* will lead to suffering; just make the sending socket a global. it's okay. because we want to be able to read from it later. */ int send_hello_message() { /* don't recreate the socket every time... that is, this fragment is bad. */ unicast_socket = socket(PF_INET, SOCK_DGRAM, 0); sendto(unicast_socket, message, length, dest_addr, sizeof(struct sin_addr)) close(unicast_socket); /* don't do this */ } while (1) { send_hello_message(); } both messages will have different port. /* don't do this */ IF you want to know what port you bound to, getsockname. ... after call to bind or sendto. BUT this is not necessary. /* this is fake syntax */ bind(multicast_socket, { AF_INET, 224.0.50.111, 41710 }, number); /* unnecessary: */ bind(sending_socket, { AF_INET, INADDR_ANY (0) , 0 }, number); /* I think this is close to legal syntax for struct stuff. */ struct sockaddr_in m = { AF_INET, inet_addr("224.0.50.111"), htons(41710) }; bind(multicast_socket, &m, number); source address: function of (getpid(), account_id, 12), htonl'd send "hello" every 25 seconds. Table for this assignment includes: IP address and port of the neighbor. out of the recvfrom. network address from the header (source address in our special header) account_id (from the header) is alive? (have we heard from this guy less than 2 minutes ago). you can just omit, discard any entry that is not alive. intent: support debugging. time remaining (how much longer we'll think him alive for). 0 if nothing left. see another message, update the existing entry. (ip,port) <- unique. (net address from header) <- unique MAY update on either/both, whatever, I won't screw with it. when stdin is closed, die. ^D will close stdin. return value from fgets is zero.. . read returns zero. "print neighbor table" on stdin. (without quotes) you must respond. quickly. no sleeping for 25 seconds What is the file descriptor number of stdin? 0. If you print a value, convert it from network byte order. The two-minute expiry. when you receive, note the time. when you print, do the comparison. NO BUSY LOOPS! NO SLEEPING FOR 1 SECOND ALWAYS. How? you may ask? "select" "poll" you can choose. select is traditional, portable, ugly, not very scalable. poll is new and exciting, and uses an array of structures, scales better. Problem: mulitcast socket, unicast socket, stdin, and we have a timer (we have to wake up every 25s). "select" allow us to punt this problem to the kernel. select takes: first param maximum file descriptor number you're using + 1. max( unicast_socket, multicast_socket) + 1. three bitmasks: readable, writable, exception(?) we only care about the "readable" that is, that won't block if we call read or recv. on input, we set the bit if we're interested. on output, kernel sets the bit if there's something interesting. last param: how long (duration) to wait before waking us up. will it always be 25? NO!!! right after you sent a hello, probably 25 (25 minus epsilon) but, if you're interrupted. will it always be 1? NO!!! (because you know when you have to send a new hello.) struct timeval *. when you send the first hello. call gettimeofday(struct timeval *tv, NULL) /* globals */ struct timeval now; /* has only two fields. */ struct timeval next_hello; struct timeval sleep_interval; hello transmission() { sendto(unicast_socket, hello message, len, multicast addr, maddrlen) gettimeofday(&now, NULL) next_hello.tv_sec = now.tv_sec + 25; next_hello.tv_usec = now.tv_usec; } TIMERSUB( &A, &B, &C ) while() { /* why you need to handle the carry/borrow from the usec */ /* what if next_hello.tv_usec < now.tv_usec? */ sleep_interval.tv_sec = next_hello.tv_sec - now.tv_sec sleep_interval.tv_usec = next_hello.tv_usec - now.tv_usec select(..,..,.,., &sleep_interval) } next_hello_time and subtract now. (assuming positive) macro: TIMERSUB() handles carry. ** lecture 8 Moving from basic reliability to performance, heterogeneity. * we mostly have the ideas of reliability. * now we just want to make it work faster. Early ARPANET transport and Sliding Window startup. * send more than one packet at a time. Cumulative and Selective ACK. Flow control. Phy and data-link layer tend to have only one packet outstanding at a time. Most of these (concurrency) issues are TCP related. < three layers up from the phy. PA2 Sockets: "multicast socket" -> "multicast receiving socket" only receives. only multicast. "sending socket" to "unicast socket", renamed to "something" sends to multicast address, future: send and receive unicast. present: the address information (IP, port) are remembered by your neighbor table. Early ARPANET transport. Start from stop-and-wait (one packet at a time) * well defined performance: 1 packet per RTT RTT = round trip time. * nothing to do with the bandwidth of the wire. Run in parallel! if you run four stop-and-waits concurrently, get 4x the performance End-to-end connection between sender and receiver. If you can have one stop and wait session from S to R, very little stops you from having four. In a sense, netscape/mozilla do a similar thing. open many connections to the same server at the same time. Catch (or at least one of the catches) * four concurrent streams sender's perspective (send pkt, recv ack) time ... -> A pkt ack;pkt ... wait for timeout ... rxpkt B pkt ack;pkt ack;pkt ack;pkt C pkt ack;pkt ack;pkt ack;pkt D pkt ack;pkt ack;pkt ack;pkt Aside: simplex, half-duplex, full-duplex links. (cat 5 100 Mbit one pair for up, one pair for down) (only one guy can send at a time, but both ends can send (old school ethernet, 802.11) (satellite like links; one direction only.) (one type of channel to send one direction, a different type for the other.) Speaking about stuff at the transport layer, these issues kind of don't matter; we are up toward the level where there are devices (routers or switches) connecting many wires, these devices have buffers (can queue packets if the medium/wire is busy); the connections are long-haul (many milliseconds), making performance dominated by latency, not bitrate of the wire. concurrent streams: * complicated putting the original stream back together. * loss of packet detected only by timeout. * (don't know) what happens when some streams get way ahead of others? *Sliding Window Transport* Decide on a "window" size. Send "window" packets into the network, unacknowledged, at a time. Acknowledgements are *cumulative*. (not selective, can't ack out-of-order data.) IETF RFC 793 (TCP) - RFC's are the standards documents for the Internet. if the number of the RFC is less than 1500, it's probably good, well written. RFC 791 (IP) Imagine you have an infinite stream of bytes to transmit. * In TCP, bytes have sequence numbers, acknowledgements apply to bytes. could send the same bytes in two different packets. At the sender, we have the following bytes to transmit, and we have a window of 5 bytes. and a packet size (MSS) of 1 bytes. -------------------------------------------------------- bytes already transmitted and acknowledged. (pre-window) bytes transmitted but not acknowledged. (in the window) bytes we're allowed to transmit, but haven't yet. (in the window) bytes we're not yet allowed to transmit. (beyond the window) bytes we (as the kernel) haven't even been given. At the receiver, ---------------------- bytes that have been delivered to the app. bytes that we've received in order that are ready for the app. (been acknowledged) bytes that we've received, but not in order. non-existent space to store bytes beyond the window. ---=----= - -- - | Send Sequence Variables | | SND.UNA - send unacknowledged | SND.NXT - send next | SND.WND - send window | SND.UP - send urgent pointer | SND.WL1 - segment sequence number used for last window update | SND.WL2 - segment acknowledgment number used for last window | update | ISS - initial send sequence number | | Receive Sequence Variables | | RCV.NXT - receive next | RCV.WND - receive window | RCV.UP - receive urgent pointer | IRS - initial receive sequence number | | [Page 19] | | September 1981 |Transmission Control Protocol |Functional Specification | | The following diagrams may help to relate some of these variables to | the sequence space. | | Send Sequence Space | | 1 2 3 4 | ----------|----------|----------|---------- | SND.UNA SND.NXT SND.UNA | +SND.WND | | 1 - old sequence numbers which have been acknowledged | 2 - sequence numbers of unacknowledged data | 3 - sequence numbers allowed for new data transmission | 4 - future sequence numbers which are not yet allowed | | Send Sequence Space | | Figure 4. | | The send window is the portion of the sequence space labeled 3 in | figure 4. | | Receive Sequence Space | | 1 2 3 | ----------|----------|---------- | RCV.NXT RCV.NXT | +RCV.WND | | 1 - old sequence numbers which have been acknowledged | 2 - sequence numbers allowed for new reception | 3 - future sequence numbers which are not yet allowed | | Receive Sequence Space | sliding window sender numbers will represent sequence numbers on the one-byte packets. transmissions : 1 2 3 4 5 6 7 8 9 A acknowledgments:0 1 2 3 4 5 <- each ack includes everything up to and including* that value. * not actually TCP transmissions : 1 2 3 4 5 67 89 acknowledgments:0 (acks 1 & 3 lost) 2 4 <- * "real" TCP uses "delayed acks" - intentionally sending only every other ack for in-order receipt of packets. ((almost) doesn't hurt anything). transmissions : 1 2 3 4 5 (2 lost) 6 acknowledgments:0 1 1 1 1 1 <- scheme for cumulative is to send an ack on every received packet. an ack says what you have. transmissions : 1 2 3 4 5 (2 delayed) acknowledgments:0 1 1 1 1 5 <- 2 arrived ^ Retransmission schemes in a sliding window transport protocol: * timeout. some (long) time went by. retransmit * RTO > RTT, and RTO > 1 second. * "fast" retransmission: three duplicate acks. * "janey hoe-style" retransmission: any duplicate ack if ack is overdue (rtt). (won't talk about it much) overdue > RTT. transmissions : 1 2 3 4 5 (2 lost) 6 2 acknowledgments:0 1 1 1 1 1 6 <- transmissions : 1 2 3 4 5 (2 & 3 lost) 6 2 acknowledgments:0 1 1 1 1 2 <- when packet received 1 4 5 6 2 difficult to recover using only fast retransmission. [ 12345 [ 6789A 12345 ] 6789A ] Two main ways to alter the size of the window: FLOW CONTROL - manage the receiver's buffer. if the receiving app is slow, don't want to send packets through the network that would be dropped at the destination. CONGESTION CONTROL - manage the network's buffers. if you're using all the bandwidth, and someone else wants to share. RTT estimation... ** lecture 9 Sliding window introduced and described in Peterson S2.5.2. Cumulative acks, (As well as the alternates, "selective" and "negative") Sliding window in TCP reintroduced in Peterson S5.2.4. Adds receiver's advertised window. S5.2.5 extends with two means of preventing "tinygrams" Then! connection setup 5.2.3. Aside: Negative ack - receiver somehow figures out that a packet is missing and asks for that one to be retransmitted. PA2 issues: select returns 0 -> no file descriptors were ready. -1 means error. positive means the number of file descriptors are ready. if(FD_ISSET(&readfds, socket_fd)) { } /* style: on't use "else": select might give you more than socket. independent if's to handle many being ready at once. */ if(FD_ISSET(&readfds, 0)) { // read from stdin. } Flow control time. controlling the amount of information going into *receiver's buffers*. (contrast congestion control, which protects the network's buffers.) we would like to not have packets dropped at the receiver when the receiver has no more space to store out of order packets or data for slow applications. devices that can deal with data slowly (printers, audio thingys) will find some way to slow the sender down. (protect their buffers.) In TCP, this is handled by a field in the acknowledgement that expresses the size of the window. (the remaining space in the buffer allocated to this transfer.) offset from the ack sequence number to the right edge of the window. .<- beginning of this connection's seq space --------|-------|-----|------ ******** received, given to the application ******** received in order, but the app hasn't read. ***** not yet received, or out of order ****** sender better not have sent. ^ sequence number we've acked. -----| ack will include both: a) the sequence number of the ack (th_ack) (^) b) the number of bytes after the ack that (-----|) the receiver is prepared to receive (th_wnd) TCP: the ack value is that of the next byte expected. (i.e., not yet received byte) (I don't really know why this decision was made.) If the receiver's application hasn't called read(), what happens as new, in-order packets arrive? a) advance th_ack -- the ack number to the next byte expected. b) keep the right edge from moving: subtract from the advertised window (th_wnd). Before: ^-----| ack 5 wnd 5 After : ^--| ack 8 wnd 2 Key: ^ - represents the ack sequence number. - - represents a byte in the advertised window. | - represents the right edge of the window. Sequence space in TCP applies to bytes. TCP is a bytestream-oriented transport protocol. If the sender sees "ack 8 wnd 2" Sender can send "8" and "9" After After: ^| ack 10 wnd 0 <--- closed window. How does the sender learn when to send. [[ push this on the stack ]] What happens when the receiver finally calls read()? Before : ^--| ack 8 wnd 2 After : ^----------| ack 8 wnd 10 If the application calls read, we get to send an ack to open the window. *(not precisely true) Let's receive out-of-order data "9", "10" "11" (not 8.) Before : ^----------| ack 8 wnd 10 (completely "empty") After 9: ^-=--------| ack 8 wnd 10 (one is used...) After10: ^-==-------| ack 8 wnd 10 After11: ^-===------| ack 8 wnd 10 NOT !!!: ^-===---| ack 8 wnd 7 Key: - sequence number available in the buffer. = sequence number we've received out-of-order The advertised window is a committment to provide the memory resources. Begin description of tinygram pathologies and fixes! Let's say the application is telnet. Before: ^| ack 10 wnd 0 the telnet application might getchar. read(socket, &c, 1); After: ^-| ack 10 wnd 1 Sender gets the ack. Sends "10" (one byte only). At the dest: ^| ack 11 wnd 0 A bit later: ^-| ack 11 wnd 1 and the process repeats. What happens if the receiver's window-opening ack is lost? Before: ^| ack 10 wnd 0 Then : ^----------| ack 10 wnd 10 *lost* The answer is *not* receiver retransmits the ack. Because the sender is the responsible one. The sender knows if there's more to send. And there's no ack of an ack. Sender allowed to send a "window probe" After a timeout-sized period (seconds), sender can send the next byte expected. (#10) Identical to all other data packets *except* that it's sequence number is at (or just over) the right edge of the window. Tries to figure out if the window has opened, since there's some reasonable chance that the window-opening acknowledgement was lost. "Silly Window Syndrome" How might we avoid this? (-) could forbid the app from reading characters at a time. (not expected to work... those pesky application writers. can't trust them.) (a) receiver: send one of these window advertising acks only when there's a substantial growth in the advertised window. (i) substantial: a whole segment* (much larger than one byte) or a half of the buffer (might have made sense for very low-resource devices). * maximum segment size (mss) \approx MTU-header size. (b) sender: don't send into a very small window. (unless all remaining data to send fits.) (c) receiver: delay the acknowledgements: hope is that the application will have (either ack every other packet waiting ~200ms for the second.) The book might make it appear as if Nagle's algorithm is a technique for silly window syndrome avoidance. Don't be fooled. ** lecture 10 Midterm: 3/25. SWS Review Nagle TCP header fields Connection setup and the TCP state machine. Connection trace with some flow control. PA2: bytes_read = recvfrom(s, buf, buflen, &sender_address, sizeof(struct sockaddr_in)); sender_address is in network byte order. sin_addr.s_addr is in network byte order. sin_port in network byte order. (ntohs) ... table[i].ip_addr = ntohl(sender_address.sin_addr.s_addr); table[i].udp_port = ntohs(sender_address.sin_port); ... printf("%s, %d, ...", inet_ntoa(table[i].ip_addr), table[i].udp_port); inet_ntoa expects network byte order, returns the string. (probably deprecated and bad since not thread safe). inet_addr("127.0.0.1") provides network byte order. PA2: continue to check for protocol, version. don't compare the payload to anything. Silly window sydrome review: Send many tinygrams because the receiver's window is only opened by one (or very few) byte(s) at a time. Avoid by: (a) don't send into tiny buffers, (b) don't advertise tiny buffers, (c) delay acks expecting the app to empty out the buffers. Another way to get tinygrams, even with empty receiver buffer. If the sender produces one character at a time. My slow keystrokes! Or very lame applications that call for(i=1; i (let's say this takes a while. long distance network) <----ack----- --- "ello" -> can send this only after ack of "h" (say, a while later... same network) <----ack----- if the bizarre code above (for(...) write(... 1)) --- byte 1 ---> --- 2:501 ---> - 502:1001 ---> corner case / neat trick: on timeout, you could send bytes 1:499. Times when you really don't want Nagle's algorithm delaying your transmissions: - real time! - "sending an emergency self-destruct command". - voip -- you kinda don't want to use TCP to begin with. it's okay to lose stuff, it's not really okay to delay stuff. - games -- might also use something non-TCP-ish. - "canonical example" is mouse movements in VLC, remote X. TCP Header Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | |U|A|P|R|S|F| | | Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ TCP Header Format refer to rfc793 for detail. Transmission Control Protocol Functional Specification +---------+ ---------\ active OPEN | CLOSED | \ ----------- +---------+<---------\ \ create TCB | ^ \ \ snd SYN passive OPEN | | CLOSE \ \ ------------ | | ---------- \ \ create TCB | | delete TCB \ \ V | \ \ +---------+ CLOSE | \ | LISTEN | ---------- | | +---------+ delete TCB | | rcv SYN | | SEND | | ----------- | | ------- | V +---------+ snd SYN,ACK / \ snd SYN +---------+ | |<----------------- ------------------>| | | SYN | rcv SYN | SYN | | RCVD |<-----------------------------------------------| SENT | | | snd ACK | | | |------------------ -------------------| | +---------+ rcv ACK of SYN \ / rcv SYN,ACK +---------+ | -------------- | | ----------- | x | | snd ACK | V V | CLOSE +---------+ | ------- | ESTAB | | snd FIN +---------+ | neil sez: close CLOSE | | rcv FIN V for write ------- | | ------- +---------+ snd FIN / \ snd ACK +---------+ | FIN |<----------------- ------------------>| CLOSE | | WAIT-1 |------------------ | WAIT | +---------+ rcv FIN \ +---------+ | rcv ACK of FIN ------- | CLOSE | | -------------- snd ACK | ------- | V x V snd FIN V +---------+ +---------+ +---------+ |FINWAIT-2| | CLOSING | | LAST-ACK| +---------+ +---------+ +---------+ | rcv ACK of FIN | rcv ACK of FIN | | rcv FIN -------------- | Timeout=2MSL -------------- | | ------- x V ------------ x V \ snd ACK +---------+delete TCB +---------+ ------------------------>|TIME WAIT|------------------>| CLOSED | +---------+ +---------+ TCP Connection State Diagram Figure 6. 7.456042 IP 10.0.1.2.59655 > 10.0.1.1.6000: S 470725384:470725384(0) win 65535 7.457239 IP 10.0.1.1.6000 > 10.0.1.2.59655: S 475384637:475384637(0) ack 470725385 win 8192 7.457292 IP 10.0.1.2.59655 > 10.0.1.1.6000: . ack 1 win 65535 7.710674 IP 10.0.1.2.59655 > 10.0.1.1.6000: P 1:457(456) ack 1 win 65535 7.710763 IP 10.0.1.2.59655 > 10.0.1.1.6000: . 457:1905(1448) ack 1 win 65535 7.710798 IP 10.0.1.2.59655 > 10.0.1.1.6000: . 1905:3353(1448) ack 1 win 65535 7.710826 IP 10.0.1.2.59655 > 10.0.1.1.6000: . 3353:4801(1448) ack 1 win 65535 7.713372 IP 10.0.1.1.6000 > 10.0.1.2.59655: . ack 4801 win 41540 likely to describe some of the rest later. ** Lecture 11 PA3 RTT Estimation, Karn's (or Karn/Partridge) algorithm RTO calculation ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z Hubs, Bridges, and Routers. PA3 to rate-limit unicast transmission. "safety" - after all, we don't want you hosing any campus networks. ensure that no more than one packet (1000 bytes max) to a dest goes out in any 0.1 second. <- service rate. burst queue of 10 packets. if I send 12 all at once, I expect at least one to be dropped. much like "print neighbor table" and "quit" main "new" command: "sendmsg %u %s" <- our network address (getpid <<..) followed by a string message. okay nevermind.... %s will end on spaces, maybe that's not good. fgets(stdin, ...) sscanf(buffer_from_stdin, "sendmsg %u %n", &dst_addr, &msg_begins_at_index) // fscanf(stdin, "sendmsg %u %n", &dst_addr, &msg_begins_at_index) strcpy(message_data, &buffer_from_stdin[msg_begins_at_index]); //roughly // also figuring out the length.... use dst_addr to find the neighbor entry. -> gives us the ip and port. copy and paste.... sendmsg 11331 hello sendmsg 11331 how are you sendmsg 11331 what's happening PA2: you send on socket A to the multicast address. you receive on socket B the messages sent to the multicast address. PA3 adds: you receive on socket A the messages sent directly to you. for the per-neighbor rate limiting,... how? add a field to the neighbor structure saying when the last packet will have been sent. (tricky) When you're given the first packet for a destination, send it! it's away. When you're given the second packet for a destination, a) if it's been at least 0.1 seconds since packet #1, send. b) queue. when will it get sent? 0.1 seconds after the last sent unicast packet to that address. When you're given the third packet for a destination, a) if it's been 0.1 seconds since packet #2 was already sent, send. b) else queue, until 0.1 seconds after #2 was (or will have been) sent. You *can* maintain one queue per neighbor. You can also maintain just one big queue of events. :) to know when the last packet would have been sent, schedule the next packet for 0.1 after max(that last guy, now) which means, you don't have to grub around in every (even idle) neighbors to find the next packet to send and when it will get sent. Be careful that you do not give select a negative number of seconds to sleep. Error messages: "ERROR:NOBUFF" that is, don't block. "ERROR:NOROUTE" the guy is not my neighbor. print to stdout. select([ ], nil, nil) Back to the non-PA course content... RTT Estimation, Karn's (or Karn/Partridge) algorithm RTO calculation ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z TCP reliability stuff: RTT & RTO Why track how long it takes to get a reply from the receiver? (The RTT) Will let us set the RTO (retransmission timeout.) If RTO >> RTT -> slow. if we lose a packet, it will take unnecessary time to retransmit it. If RTO \approx RTT -> might retransmit packets that don't need to be retransmitted. Initial RTO in TCP is 3 seconds. Measuring RTT. Send a packet. Get the ack. Subtract the times! RTT can vary, especially if you choose a packet that's gonna be queued in the network for a while. what's the catch? could lose that packet. can still retransmit. if you get the ack for the retransmitted packet just after you sent the retransmission, your estimate of RTT will be way too low. VERY bad: when your RTT is large (larger than your estimate) (meaning your estimated RTT makes you retransmit early.), you decide that the RTT is way small. (because the ack of the original transmission ends the timer that the retransmission started.) Karn/Partrdige algorithm - don't take RTT samples while retransmitting. correct answer: "don't do that" RTO estimation: combine "mean" RTT and deviation to find a good RTO (good means longer than most all RTT's but not by much). Calculating the "mean" uses "stochastic gradient" mean = (1-\alpha) mean + \alpha * sample m := measured value a := average RTT estimate v := mean deviation (variance) not divide. integer arithmetic only. shifts are okay. sa := "scaled" average RTT estimate (multiplied by 8) sv := "scaled" variance estimate (multiplied by 4) // gets us the average m -= (sa >> 3) // m is now the "error" sa += m // added 1/8 the error to a // gets us the variance. if (m<0) m=-m; // error = abs(error) m -= (sv >> 2) // what's the difference between the // error and the expected error sv += m // added 1/4 the variance to v. // finally rto = (sa >> 3) + sv For next time: Read S 3.2 we should start bridges next time. ** lecture 12 Ruby fragments for PA3. TCP states review in system calls. Addressing. 3 tricks for Ruby in PA3 1. fool the makefile 2. fool the compiler warning test 3. setsockopt for multicast. # so that "make" just runs. # so that executable bits get set. three: three.rb cp three.rb three chmod +x ./three three.c include: int main(int argc, char *argv[]) { exit(0); } =============== #!/usr/bin/ruby # Other notes: remind neil to check version ruby on submit server # IPAddr not necessarily present. require 'socket' addr = '224.0.50.112' mreq = (addr.split('.').map { |octet| octet.to_i } + [ 0 ] * 4 ).pack('C*') raise "oops" unless mreq.length == 8 multicast_socket = UDPSocket.new multicast_socket.setsockopt(Socket::IPPROTO_IP, Socket::IP_ADD_MEMBERSHIP or 12, mreq) readable, dummy, dummy = select([$stdin, multicast_socket], nil, nil, 0); readable.each { |rdfd| case rdfd when $stdin puts "read from stdin: %s" % $stdin.gets when multicast_socket puts "got a packet" else raise "argh" end } __END__ TCP States and system calls. to create a TCP socket: call socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) could possibly call bind before the connect (though I don't know quite why. "getting past stupid firewalls" suggested.) to create a client (active open): call connect(s, destination_addr/port, addr_len) call send, write, call read, recv, select, to close the connection: close(s) if we are done writing but not reading: shutdown(s, SHUT_WR); // generates the fin. // it would be nice if web clients used this. if we are done reading but not writing: shutdown(s, SHUT_RD); to create a server (passive open): bind(s, (the port we want to claim "80"), addr_len) listen(s, 5); // parameter is the backlog of not-yet-accepted connection. can also call select on s (in rdfds) to tell if there's a connection waiting. connection_socket = accept(s); // will block until there's a new connection to be accepted. after the accept() there are two sockets: one that is the new conversation, and one that is still listening. read(connection_socket, ....) write(connection_socket, ....) eventually close(s); never read(s, ...); write(s, ...); shutdown or close(connection_socket); Design of Apache: create the accepting socket. fork 8 times. each forked child process would call accept. the kernel would give a new socket to one of the children. *********** Addressing: Why do packets, datagrams, etc. need addresses? A) For routing. If I send a packet with an address of a specific destination, I want the network to carry it there (and nowhere else) <-- IP addresses are like this. wikipedia NickServ for a nice little story. B) For the recipient to decide if they're interested in the packet. <-- ISA bus, or "Classroom" Ethernet. All the interfaces receive the packet, but only those addressed wake up and pass it along. To send an IP datagram across an Ethernet, what destination Ethernet (MAC) address does the sender put in the frame? a) if the destination is on the same subnet (i.e., is nearby, on your Ethernet segment), want: the destination ethernet address. b) if the IP destination is on a different subnet, (i.e., is far away), want: the ethernet address of the gateway. That is, IP will tell us what the next hop across the ethernet is. *) For example, destination IP address is 128.8.128.8, .... [ Ethernet header: src mac address <- the "hardware" address used by the data-link layer underneath (in this case, IP) dst mac address protocol field (ip) [ IP header: .... src IP address dst IP address [ data ] ] ] Goal: find the ethernet address corresponding to an IP address. Scheme: ARP (address resolution protocol). Two messages: "who-has", "is-at". How/who do we ask "who has (what is) the ethernet address corresponding to IP address 128.8.128.8?" ... ask everyone. Request: [ Ethernet frame: src mac: my address. dst mac: broadcast. ff:ff:ff:ff:ff:ff protocol: ARP (there's a different number for ARP than for IP) [ ARP message: who-has ip: 128.8.128.8 ether: ] ] Response: [ Ethernet frame: src mac: that guy's address (hooray?) dst mac: my address. protocol: ARP [ ARP message: is-at ip: 128.8.128.8 ether: 00:17:c1:00:84:23 ] ] From these responses, we build an "ARP table" maps IP addresses to ethernet addresses. Run 'arp -na' Alternative to setting up IP routing properly: Proxy ARP. Two ways: 1) have a machine send arp responses for an additional ethernet address that is actually someplace else. (modem attached, in vmware) machines don't complain when more than one IP maps to the same ethernet address. 2) have a router answer arp for every IP address in the universe (or at least not on your segment) ** lecture 13 Review: ARP, Ethernet addresses, IP addresses PA3 things? or are you done? of course you're done... Hubs, Bridges, Switches. C refresher quiz #include int main(int argc, char *argv) { int i = 10000; printf("%d\n", i * 0.001); printf("%d\n", 0.001 * i); /* be afraid, be very afraid mixing floats and ints. */ } when sending a unicast to a destination neighbor, include that neighbor's integer address in the message header. Ruby's pack and unpack will take care of the byteswap if you use the right characters. No secret tests. (or release tests.) ARP review: What does an ethernet address look like? 6 bytes, usually formatted 00:00:16:28:AE:23 Structure of an ethernet address: first three bytes - allocated centrally, given to a manufacturer. last three bytes - expected to be unique, since they're allocated by the manufacturer. Broadcast: ff:ff:ff:ff:ff:ff Despite the two-level hierarchy in assignment of the address, these addresses are "flat" -- would need an entry for every device address if you wanted to reach it. What does an IP address look like? 4 bytes, usually formatted dotted decimal: 128.8.128.8 Structure of an IP address: There are "classes", classes are identified by the first few bits, and class-"ful" addresses are largely deprecated. New scheme: explicit prefixes with lengths: the number of bits in the "network" part of the address is given with the address. 128.8/16 <- represents all the machines that share those first 16 bits. 12/8 is shorthand for 12.0.0.0/8 in which only the first eight bits are significant. subnet mask is a slightly more flexible (but useless and verbose) way of getting the same thing. 12.0.0.0/255.0.0.0 is another way of expressing 12/8. Imagine the router at the boundary of Computer Science Dept. Knows about: 128.8.126/24 <- one of the subnets (prefix) allocated by UMD to CS. 128.8/16 <- rest of maryland These rules (prefixes) overlap! How do we choose the right one: longest prefix match. 128.8.200/24 <- a subnet (prefix) allocated by UMD to Chemistry. Purpose of ARP: for an IP on your subnet, get the ethernet address. I drew boxes on the board. Each of these boxes was assumed to carry traffic... to direct it smartly. It turns out, that's not always possible... or how it works. - we've already seen the "classroom ethernet" broadcast scheme. 10 Mbit hubs... - make large ethernet segments using "hubs" which is a euphemism for a repeater (with more than one port). repeater means "signal" repeater. - "dumb". could cause errors... would repeat broken packets. - wasteful or not scalable. (more stations -> more traffic -> effectively less capacity.) - aside: a somewhat interesting catch: there's a limit: you can only have 4. okay: ---segment---[hub]----[ ]----[ ]----[ ]------- not : ---[ ]----[ ]----[ ]----[ ]----[ ]----- if you have 5 or 6, collisions can go unnoticed. - other damage: [ ]--->[ ] \ / \ / [ ] signal will go around and around and around. there's a way to add a wire and break your network. rather poor interface. Hub / repeater is a physical-layer device. Next step: "learning" bridge, data-link layer device. data-link layer: frames. also are going to have ethernet addresses. segment A -------{ bridge }------- segment B bridge to read frames off of one segment and: a) leave them there. b) copy them to the other segment. How does the bridge decide? consult a table. Forwarding table. (switch forwarding table) How does the bridge build the table? Take the source MAC address of any frame you see on the segment. Ex. See frame from Alice on segment A, add to table: segment A -------{ bridge }------- segment B Destination | Segment Alice | A When we see a frame for alice: if from segment B, forward to segment A if from segment A, do nothing. (update the forwarding table.) What happens if we "accidentally" create a loop? - [ ]--->[ ] \ / \ / [ ] Frames can still spin... Aside: why can IP packets not spin forever? TTL. do ethernet frames have TTL? no. Solution: a) don't do that. b) spanning tree protocol. Developed by Radia Perlman. basic scheme: find links to disable so that the effective topology is a tree. Spanning tree protocol: 1) Elect a "root": the root is the bridge with the lowest MAC address. Each bridge "claims" to be the root, the one with the most legitimate claim to the root (lowest address) stays the root. 2) Root sends out a message, all the bridges disable the ports that they don't need along the shortest path to the root. a) not the bridge that gets disabled. b) not the segment that gets disabled. c) it's the port not along the shortest path to the root. ** Lecture 14; midterm review. ** Midterm review sheet. PA3 We talked forwarding over bridges last time. Forwarding <- (optional: looking up in some table) and then dumping the packet on the next link toward the destination Routing <- setting up that table for potentially every destination possible. Involves sending messages. Can distinguish a "routing table" from a "forwarding table" Routing table might add "metric", "path" (diagram of router host processor vs. interfaces) Goal: the best path. "Best" means: - lowest hop count - lowest latency (least time). - least overloaded (less likely to have packets dropped) - cheapest (will most likely apply later) - highest bandwidth Could start with hop count, and then increase the "metric" on any overloaded link. Routing "updates" each router will sometimes "advertise" the destinations it can reach. Periodic updates = Say, once a minute, send a message to all neighbors describing the destinations that this router knows how to reach. - directly attached networks. - networks learned about from neighboring routers. Triggered updates = When you learn something new, propagate it quickly. New destination. Cheaper route. Goal is to speed convergence. Routing loops: can occur when routers have inconsistent information about where to go. Convergence: all the routers have a consistent view of the network. - as a result, no loops. Two main levels of routing: intra-domain routing (Interior Gateway Protocol) -routing within an organization (just your IP prefixes) -fun, easily understood, not complete hack protocols inter-domain routing (Exterior Gateway Protocol) -"BGP" routing across organizations (all IP prefixes) Two classes of routing algorithm: Link state - Updates consist of fragments of the topology. the links incident on a router. From these fragments, assemble a Link State Database that is the assembled puzzle. From that graph, Dijkstra's algorithm finds the shortest path to all destinations. OSPF (also IS/IS) are implementations. Distance vector - Updates consist of (partially limited) views of the routing table. (neighbor, metric) pairs. Distance Vector. Criteria for accepting a route: * If the new route is better, take it (obvious one) - "no route" is worse than any route * If the new route comes from the chosen next hop. A <--> B <--> C example in which B -> C failure (and some bad luck) causes a routing loop between A and B. Obvious hack: "split horizon" Split horizon: Don't advertise routes back to the router you got them from. (of dubious value "poison reverse" is: advertise them at at cost infinity.) Infinity = 16. Count-to-infinity problem: the routers sometimes dealing with a lost link may count to infinity. Can't have a network with diameter > 16 (and expect everybody to be able to reach everybody else). Alternate scheme: "path vector" (in BGP) each router (entity) ensures that it's not already on the path. This is roughly: "RIP" Routing Information Protcol. ** Lecture 15 (there was a midterm there.) Pre-midterm, pre-break review. RIP, infinity, split horizon. The next Project assignment. s/W /When / bencoding libraries are fair to steal (incl rubytorrent/bencode.rb)... other bittorrent client bits not fair to steal... gilligan's island rule over other implementation source code is fair. (absolutely no copy and paste). tcpdump (packet trace) of an ordinary client. - I will probably assign an exercise where, given a tcpdump-formatted output file, you'll annotate the messages with what they're for and what they contain. - a homework where you might write some code (merge pcap library with your bencoding library. Back to routing: distance vector routing: triggered update: learned something new, we spread it around quickly periodic update: nothing necessarily new, refresh it all the time. route withdrawal: lost the link/route, advertisement with infinity as the cost. infinity: 16 in RIP count to infinity: unlucky timing / loss causes routing loops resolved by each router incrementing toward infinity. split horizon: don't advertise a route to where it comes from. convergence: everybody has a consistent view of the routes (no loops). two reasons to accept a distance vector route: - cost is less than what you already know - it is advertised from the neighbor you're using anyway. link state routing: link state packets (LSP): describe fragments of the network. I believe with TTL=1 so never forwarded. link state database (stored in): all the current LSPs. reliable flooding (propagated by): a) make sure everyone gets it b) try not to let an LSP traverse any link more than once. 1) sequence numbers - accept and propagate if seq. num. is newer than what the node has in the LSDB - what if the sequence number is the same (already known) do nothing. - what if the sequence number is old? send em the new information. 2) acknowledgements what do you do when the sequence number wraps? finite 32 bit field... static inline int before(__u32 seq1, __u32 seq2) { return (__s32)(seq1-seq2) < 0; } what that means is that a < b what happens if other routers have a "newer" version of A's LSP than A does currently? LSDB can grow very large. Sprint, et al., will typcally run OSPF (new) IS-IS (older) - very large networks OSPF Areas add a level of hierarcy so that parts of the network are summarized. - Hide changes to individual links. - Summary LSPs for each area are propagated instead of all the LSPs for routers inside the area. Know Dijkstra's algorithm. Last routing protocol. OSPF and RIP are IGP's: Interior Gateway Protocols. everyone is cooperative., can centrally administer everything. Another class: Exterior Gateway Protocol. no central admin, maybe a little cooperation, no "metrics". Routing in the Internet is to make money. BGP is the "Border Gateway Protocol", the EGP of the internet. You are a small ISP. 2,000 residential customers, 10 business customers. How many providers do we want? > 1. One might be vulnerable. < 4. Diminishing returns. When presented with a packet, which way should it go? Toward provider? (which) Toward customer? (which) What does it depend on? ** Lecture 16 Midterm stats: 32-74, mean 59. Midterm Review q# 3, 5, 15, 18, 20, 23, 25, 26, 30, 31, 32, 38 # 9 went poof with #10. Grading review; April 11 is the last day to drop with W. - percentages for homeworks and projects are not precise. BitTorrent hints RouteViews dump BGP: early-exit, localpref, MEDs, prepending, import filtering, export filtering, stub, transit, multi-homed, MAE, default-free maybe: bad-widget, route flap damping, http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094431.shtml Maybe begin the "sharing" technologies. BitTorrent updated handout online. - Single file torrents sufficient (multi-file if you like). - Resume download required (it's a good test to see if you can verify the complete file). - Easy to crash mainline client as seeder; to test, run a seeder of your own. * request too large, request off the end of the file... - If you see no peers, info hash may be wrong or seeder may have crashed. - Pieces in a test torrent are 32K or larger, max request size (block) is 16K. - May be sloppy about requesting blocks more than once. - May choose pieces randomly, sequentially, rarest-first, or by any other strategy. - I was able to download a test torrent from campus to my dsl. - Use seek(). - Works on campus wireless. BGP: Autonomous System (AS) - Participant in BGP. Each autonomous system originates many address prefixes. More than one (but typically very few) autonomous systems may originate the same prefix. AS Number - unique identifier. Each organization typically has only one AS number; some may have more than one for different areas (US or Europe) or markets (commercial or defense) AS prepending - attempt to increase the length of the path to discourage its use (inbound; outbound routes can be chosen explicitly). Tier-1's - top level of AS's (providers) that pay no one for service. 209 (Qwest) 174 (Cogent) default-free - every BGP router has an explicit route to every destination. AS7007 incident - A small ISP customer of sprint took all its routes, and re-advertised them as if each prefix was one that ISP originated (owned). - sucked traffic toward a tiny ISP that had no business getting that traffic. - motivates "import filtering" - not accepting a route from a neighbor that shouldn't advertise that route. early-exit, localpref, MEDs, prepending, import filtering, export filtering, stub, transit, multi-homed, MAE, default-free maybe: bad-widget, route flap damping, http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094431.shtml -- after class notes: CGI.encode okay not to do yourself. torrent s/url/filename/ (won't save you much). ** Lecture 17 Book chapters of the foreseeable future. 4.1-3 2.6, 2.8, 5.2, 6.1-4, 9.1.3 *likely more of chapter 9; 9.4.2 describes some bittorrent) 8.1 (likely more of chapter 8) more bittorrent CGI.encode okay not to do yourself. at some point, you'll have to take a 20-byte binary value and convert it into a string that looks like %FFh% SHA1 don't implement yourself. Each "piece" is identified by its SHA1 hash. secure hash is a function from ( lots of bytes ) -> small digest that is hard to invert. torrent s/url/filename/ (won't save you much). BGP vocabulary review AS AS number transit AS - provides "transit" where transit implies advertising prefixes non-customers. stub AS - has only one uplink connection (Peterson textbook terminology) multi-homed AS - has more than one provider (transit AS). service level agreement - governs the terms of a peering (who pays what, how much b/w is there, etc.) provider - up the hierarchy, you pay the provider (not really a hierarchy, but it's a nice picture) customer - pays you for service. peer - (1) neither customer, nor provider, but still connected. exchanging traffic (likely) without cost. (certainly not a transit provider: doesn't advertise non-customer prefixes) (2) generic: any adjacency in BGP (two connected ASes) advertisement - "I can reach this prefix via this path" BGP must have paths. path of length 1 withdrawal - "I can no longer reach this prefix" tier-1 ISP - pay no one. at the top. connected to other tier-1's. Exchange points, metropolitan area exchanges, telco hotels - locations where many ISPs have routers, connect to each other. default-free - the routing table lacks a 0/0 catch-all route. Less reviewish: BGP mechanisms for path selection. export filtering - not advertising prefixes learned from providers or peers to other providers or peers. import filtering - don't accept an advertisement if he shouldn't be advertising it (AS7007) localpref - assign a "preference" value to a learned route: routes from customers have higher localpref, from providers lower localpref. is "local" - not propagated, not known. Multi-exit-discriminators (MEDs) - allow a peer (generic use of peer) to export a preference for which peering point should get a prefixes traffic. - downstream likely pays for this. - symmetric; happens, but not sensible. - you think the other guy can't run a network. internal path cost factors in -- choose the lowest cost exit point to achieve the early exit. Rule - you only advertise the path you're using. Could use traceroute -A to see the autonomous system numbers on a path to a destination. Unit of Sharing: CSMA, etc. Radio spectrum, time on a wire (ethernet segment). Fairness, Efficiency, Simplicity/cost. Spans data-link layer and some network and transport layer stuff. CDMA - Code division multiple access. - every can transmit at the same time, and the individual transmissions can be recovered through math. TDMA - Time - everyone is assigned a slot, in that slot, you may transmit. FDMA - Frequency - slice into channels. (GSM combines TDMA and FDMA) Token-ring - pass the token, while holding the token, you can send. - have to make sure that there's always a token. DQDB - ask for a slot, and then eventually get it. 802.11 carrier sense - wait for the medium to appear idle, just sends. Ethernet we'll discuss the differences. Aloha ALOHAnet on wikipedia. ** lecture 18 BT; BT1 project on the submit server. currently in upload-only mode (no tests). try to handle non-existent peers: connect times out (long time) connection refused. Neat question RE BGP - ~ 130,000 globally announced prefixes. 2^32 addresses are represented by these aggregates. advertisements for prefixes smaller than (with longer prefix length than) /25 (plus or minus one) are discarded. - potential for "route aggregation" 128.8/16 128.9/16 both going the same direction, through the same ISPs, you could reformulate that to something else: 128.8/15 - HOWEVER: if someone else saw routes for 128.8/16 and 128.8/15 the recipient would choose the more specific route. - This leads to a trick when selectively advertising more specific routes, though neil can't remember it. doh. Ethernet-type sharing. Classroom ethernet -- thick coaxial cable, -- 500m segments with up to four repeaters for 2500m range. -- manchester encoding -- all of these things are no longer used.... -- terminators. (make the wire appear to have infinite length) -- addressing remains. Collision domain. -- all the hosts that, if they transmitted at the same time, would have their transmissions collide. -- transmissions that collide are not recovered. -- Bridge (or router, or anything smarter than a hub) can divide a collision domain. Ethernet for the purposes of the "sharing" part of the class = 1-persistent CSMA/CD with binary exponential backoff CS = Carrier Sense. MA = Multiple access (CDMA, FDMA, etc.) CD = Collision Detection (also, compact disc) What's carrier sense? Attempt to avoid collisions by: don't send if you hear someone else on the wire when you want to send. Wait until the medium is idle. What's collision detection? Each station has the ability to recognize when a transmission they sent collided with another transmission. After detecting a collision, the station will "jam" the network. an extra signal to ensure everyone knows there was a collision. Binary exponential backoff after collision, each station flips an independent coin, and decides whether to wait 0 or 1 x 51.2 microseconds. with 1/2 probability, we're done (one station will transmit immediately, and the other will wait) with the other half, both collide again. roll a d4 (0..3) x 51.2 with 3/4 probability, we're done. (they won't wait the same number of slots.) Why 51.2? round trip delay across the largest possible ethernet. maybe 1.4 ms to send a full size Ethernet frame. Ethernet has no acks. why? (a) not defined to be reliable. (b) pretty reliable anyway. Minimum frame size of 512 bits. Goal: ensure that the sender can detect any collision. Why binary exponential backoff: What does 1-persistent mean? if the medium is idle (and you have something to send) send without delay. ** lecture 19 Handouts: CSMA state machine types (Keshav), 802.11 mac scheme (Tanenbaum) BitTorrent hints: Little fink tracker busted. :( stupid azureus either lets anyone post to the tracker or doesn't work (lame). SHA1 sha1 implementations require proper byte ordering for C. implementation under ruby is by Steve Reid, may need to define a preprocessor variable. If you have the wrong hash (for Jay, it should start FC88), the tracker will give you no peers. Can test sha1 by hashing "abc", looking for something starting A9. Tracker (if implementing your own HTTP client, which is easy enough you can do): GET /announce?compact=1&downloaded=0&... HTTP/1.0 Host: scriptroute.cs.umd.edu:11417 Peer wire protocol Must send "interested" message before expecting to be unchoked. (I think) Must be unchoked before sending a request. (I think) Works *much* better if you can buffer partial messages: C: should work simply with read and select, as long as a partial read goes into a buffer. Ruby: read_nonblock, i.e.: @connection_socket.read_nonblock(@pending_message_length - @pending_message_body.length - 1) also: connect_nonblock( sockaddr ) is very neat. Other languages: I don't care. :) Seeding (next milestone): Building a type 5 bitfield: [ @pieces.map { |p| p.complete? ? '1' : '0' }.join ].pack("B*") Socket.do_not_reverse_lookup = true Ethernet review - Hub vs bridge forwarding delay Wires break. Evolution from single bus coaxial style network to a hub/switched network. Wireless! RF acts much like a wire: noise, attenuation, delay. extra problems (or tricks): directional antennas, multipath signals (bouncing off walls, moutains, one of those spheres Hidden station (hidden terminal) Exposed station (exposed terminal) f 4.26 in handout shows both. Hidden terminal scheme: three nodes. A is talking to B, C (which might also want to talk to B) won't hear A's transmission. C sends Transmission clobbered. - carrier sense is not enough - C, A can't necessarily detect a collision. - likely really difficult to detect a collision in a wireless network. (while you're transmitting, that is). Exposed terminal: Let's use a different diagram. A - B - C - D B wants to transmit to A, C wants to transmit to D, If a node is in a really good place. has a really great antenna, is at the top of a mountain, it might never send. counterintuitive! fun! MACA (variant MACAW, I think, and I will be confused.) RTS/CTS. request, clear to send A -> B "RTS" B -> A "CTS" A -> B "Frame" B -> A "ACK" ** Lecture 20 BitTorrent part two -- rate-limited seeding. (not that hard... unless the downloader-part was minimal.) if two clients from the same ip talk to you about the same file, "okay" to drop one. (drop the older.) 802.11 interframe spacing CSMA review need new ruby on nauseated for happy nonblocking calls I'll bug staff, or you may be able to install in your homedir. In wireless transmission, collisions occur at the receiver (potentially without the senders having clue) RTS/CTS provides "virtual" carrier sense. IF in 802.11 -- (CSMA/CA) there are acks. In ethernet, not needed because typically pretty close to reliable, and any collisions would be detected by the senders. Three inter-frame spacings of note (tanenbaum handout) (ignore (to the extent you can) "fragment bursts") Short IFS - after transmission, leave enough time for the transmitter to switch to receive mode to receive the ack. DCF IFS (DIFS) - after someone else's transmission (ack), how long to wait before I (another station) can send. PCF IFS (PIFS) - if you're the access point / base station and have some control traffic to send, then you get a slightly shorter IFS before the transmission. ability to loosely prioritize traffic by altering how long each station has to wait during a contention. there are *other* instances of devices on a network that need a little help (that are more equal than others). an file server, a web server, or any other sort of server. each client station has a request; the server then has response traffic contending with the requests. Keshav figure review! Starting promotion to TCP-land (transition!) Collisions - more than one transmission is on the medium at the same time. bits go poof. Contention - more than one transmission wants to be on the wire, somebody has to wait. Congestion - (higher layer idea) packets are delayed or dropped because queues are filling up. ** lecture 21 BT grading 2 points for something, anything. 6 points for working, file intact. 2 points for working the first time, pretty quickly. 2 0 0 - if I couldn't run your code easily (typical with the java turnins, sorry.) those zeroes will turn into something less zero-ish as soon as you show me how to run or extract that version of code. 2 6 0 - if it took another run or two. 2 4 0 - if minor corruption. Percentages should be dwarfed (3:1) by the final (the whole thing) project grade. PLEASE include a partners file. If your turnin did not include a partners file, please talk to me now. (otherwise, it looks the same as theft.) Ruby: resp, data = tracker_connection.request_get(tracker_url.path + "?compact if you use: resp = ...get ; resp[1] it triggers (in ruby 1.8.6) a backwards-compatibility mode that doesn't quite work right. How to write a server in C... Issue is that the assignments 1-3 used multicast socket, so you could act much like a server with just one (or two) sockets. In the downloader, you might have gotten away wth just open socket. In the server, not so much. s = socket() bind(s, ... ) <- much like the multicast thing. # could choose a port 6998 (6881) # if 6881 is busy / used, try something else (that means for loop) listen(s, 5) <- function that tells the operating system to receive connection requests from others. ( 5 is the "queue" or backlog of incoming connections that the OS should buffer for you ). ( purely local ). I don't know what happens when the backlog fills. include s into the rd_fds in select, it will be set when there's a connection waiting.... socket_to_a_client = accept(s, &peer address, sockaddr_len); ruby's scheme: socket, packed_address = s.accept @port, @ip = Socket.unpack_sockaddr_in(packed_address) now you have a socket representing a tcp connection in the established state. acts just exactly like a connection that you started using connect(). TCP congestion control Sec 6.3 in the text. I'll likely have slightly more detail in the notes, and slightly different terminology. Avoid: capture (one station gets to transmit to the exclusion of others) unfairness (for some definition of fairness) pathological performance (high delay, lots of retransmissions) Additive Increase, Multiplicative Decrease. at each step, decision, a node tries sending just some small amount more, or a bit faster, and if that fails (losing packets), back off by half. (send half as fast, reduce the window to half its old size.) this definition is purposely loose, because there are a few implementation differences. the decisions aren't really clocked by RTT. Adjust **** cwnd **** the amount of data that can be outstanding in the network at a time is the minimum of: what the sender has to send what the receiver advertised as buffer space cwnd Two phases for the adjustment of cwnd. **** ssthresh **** is the slow start threshold. ssthresh is initialized to the sender's buffer size (practically infinity) 1) cwnd < ssthresh. on every ack of new data, increase cwnd by one segment. 2) cwnd > ssthresh. on every ack of new data, increase cwnd by 1/cwnd segments. Can stop this increasing game when: 1) done completely 2) hit another limit (buffer size) 3) see a packet loss. If the loss is detected by timeout: 1) ssthresh = cwnd / 2 2) cwnd = 1 If the loss is detected by duplicate ack: 1) ssthresh = cwnd / 2 2) cwnd = ssthresh Intuition is that slow start creates an "ack clock" acks come back from the receiver at the same rate that your transmissions are leaving the network. if the sender transmits when receiving an ack, that ack is "proof" that a packet left the network and a new one wouldn't overload any device. Fast recovery -- maintain the ack clock during fast retransmission by delaying the reduction of cwnd and switching to a slow-start-like rule for *increasing* cwnd. ssthresh does not always fail. ** lecture 22 BitTorrent: http://broadband.mpi-sws.mpg.de/transparency/bttest.php http://broadband.mpi-sws.org/residential/bttest.php to test whether your connection has issues with the bittorrent protocol. I've tried it only from home. send() may return having only placed part of the data you asked for into the send (retransmission) buffer. - may return some value smaller than the length of what you told it to send. 16K pieces are fine. Check if you can seed something other than Jay (using a tracker other than mine, which may suck.) Tracker may probe you. mkdir ruby cd ruby-1.8.6 && ./configure --prefix=$HOME/ruby character encoding (java people) "ISO-8859-1". My machine defaults to UTF-8, which breaks stuff (info hash, info hash encoding) From Dr. Pugh. Too late, I know. import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; public class Encode { public static void main(String args[]) { byte[] test = { 1, 2, 3, 4 }; System.out.println(encode(test, 0, 4)); } /** * Determine the URL encoding of the SHA1 hash of a range of bytes in a byte * array * * @param data - * byte array holding the data in question * @param offset - * the offset to the first array element of interest * @param length - * the number of bytes to be hashed * @return the URL encoding of the SHA1 hash of those bytes. */ public static String encode(byte[] data, int offset, int length) { MessageDigest md; try { md = MessageDigest.getInstance("SHA"); } catch (NoSuchAlgorithmException e) { throw new RuntimeException("Can't get SHA algorithm implementation", e); } md.update(data, offset, length); byte[] bytes = md.digest(); StringBuffer result = new StringBuffer(); for (byte b : bytes) { char rawChar = (char) (b & 0xff); if (rawChar < 10) result.append("%0").append(Integer.toHexString(rawChar)); else if (isUnsafeOrReserved(rawChar)) result.append("%").append(Integer.toHexString(rawChar)); else result.append(rawChar); } return result.toString(); } static String unsafeOrReservedCharacters = " <>\"#%{}|\\^~[]`;/?:@=&"; private static boolean isUnsafeOrReserved(char rawChar) { return rawChar < ' ' || rawChar > 0x7f || unsafeOrReservedCharacters.indexOf(rawChar) >= 0; } } Congestion control. Throughput as a function of window size. _______ / / / Delay as a function of window size. _______ / / _____/ Bottleneck bandwidth Bandwidth delay product tells you the perfect window size to use. Sloping curve to the TCP sawtooth. rtt increases, so the rate of cwnd increase slows. if cwnd is not the limiting factor (perhaps receiver's advertised buffer is) then it doesn't get to grow. It would be unsafe to increment cwnd based on the false experiment of repeatedly sending, say 11 packets, to the point that cwnd could allow 32 packets to be sent at once, if the receiver was dumb enough to advertise all that extra space. similarly, wouldn't want to "halve" cwnd only to have it still be a value larger than the current window being used. being greedy at the sender: trivial, but potentially useless. being greedy at the receiver: ack division: take a 1500 byte packet, ack each byte individually. (or more practically, every 50 bytes) (reasonably easy to defend against, even if people don't). force/abuse fast recovery rule. send a duplicate ack, get the next packet. (eventually, I'll have to fess up to getting the packet he'll retransmit the frame, but I don't care. (I'm getting a faster transfer anyway) ) optimistic ack: ack stuff you haven't even seen yet. 1) lose reliability. - for http, and ftp, you can recover parts of files in separate transactions. 2) ack before the packet was sent, likely would be ignored, then you might stall. so the greedy receiver has to be able to predict where the sender will be. ** lecture 23 See course web page. 1. "new" tracker is "xbnbt" appends this warning about using a "key" if your request says "&key=ABCDEF" where ABCDEF is a random string. it will want you to continue to use that key. 1a. peers talking on nauseated can talk to each other. 1b. "new" tracker is not verifying that it can talk to you before putting you in the list. --- good: you're listed. --- bad: so is everyone else, even if they can't talk. 2. tracker running on nauseated itself. 3. technically, you're supposed to respond to an incoming handshake message after the info_hash (potentially before the peer_id) 3a. old tracker was testing this on you. 3b new tracker just lists you anyway, even if you cannot be contacted from any other machine. Bizarre scheme where event=started brings you no peers. (so don't set event=started. :) append to .bashrc: export PATH=$PATH:/home/me/ruby/bin Sections 8.1, 8.3.4 and 8.4 for tuesday (diffie hellman, basic security stuff, kerberos) You get to use openssl or gnutls (which you're not going to use because it's not as good). Plan for today: tcpdump over tcp review with delayed ack, slow start doesn't double every RTT. window advertised server effectively keeps acking the request (acks are free) initial sequence numbers are randomly chosen TCPDUMP of an active TCP session. Sequence/ack plot ** lecture 24 Security Future lecture 25 RED, ECN, Vegas Two classes of encryption methods: Symmetric key (both sides have the same key, the same shared secret that no one else (aside from the government hehe) has. Asymmetric or public/private key methods. (one side has a very secret key, the other side has a widely-known piece). Likely: servers have such (private) keys. you'll receive their public key. <- certificate certificate == public key signed by someone you trust (verisign) signed == (to first approximation) encrypted with a private key. If I encrypt a message with your public key (the public key corresponding to your private key), who can decrypt it? - only you. If I encrypt a message with my public key (the public key corresponding to my private key), who can decrypt it? - only me. If I encrypt a message with my private key, who can decrypt it? - anyone (who has the public key, which is assumed public). - what do they know? I wrote it. - they don't (necessarily know when I wrote it) Each public/private key is a pair. Encrypting using the public key provides a message that can be decrypted only with the private key; encrypting using the private key provides a message that can be decrypted only with the public key. Expected uses of public and private keys: PGP for mail. - loosely defines the "message" to be the email message - encrypted - signed SSL / TLS ("secure sockets") SSH Typically, these operations are used only to bootstrap a symmetric key exchange. SSH, when you connect a server, the public key operations are used only to ensure that you and the server share a secret key for the rest of the session. Then you get to use the symmetric key operations for the rest of the session; these operations are faster. [[ I believe PGP will the same sort of operations for encrypting a larger message. ]] Why might you trust a public key? How do you know it belongs to your recipient? - an option is that some "trusted" entity "vouches for" the idea that the key belongs to a specific user. (verisign scheme) - PGP scheme, either you personally verify the key at some "party" or trust that someone you trust will vouch for the key's ownership. (web of trust) = the keys are signed. - signed key + owner information -> certificate. - if signed by somebody you trust, you can believe that the pair belong together. Signature : [ message ] [ ENCRYPT( private_key, HASH( message ) ) ] - guy with private_key said message. Certificate : [ Bob, public key of Bob ] [ ENCRYPT( private key of authority, HASH( [ Bob, public key of Bob ] ) ) ] - authority with private_key said Bob's public key is public key of Bob. Chain of certificates: one true verisign key, that signs more temporary keys, that signs, say OIT's key, which then signs the department's key, which then signs https://www.cs. (exaggeration) Cert: just a signature where the stuff being signed is of a specific type, and having a specific meaning. SSH's implementation -- a little bit of protocol negotiation (which algorithms are supported) followed by Diffie Hellman, followed by encrypted stuff. Diffie Hellman - Allows you to establish a shared secret with another node without transmitting the secret across the network. - an evesdropper seeing g^a and g^b cannot derive g^{ab} however, with a and g^b, alice can compute g^{ab} and bob can compute g^{ab} using g^a and b. - we would like to believe that if we sent g^a to the server, that g^b came from the server and all messages encrypted with g^{ab} can only be decrypted by the server. - why should we not yet believe that? - because g^b didn't necessarily come from the server. - man/monkey in the middle can pretend to be the server when talking to you, pretend to be you when talking to the server. Burrows Abadi Needham Logic of Authentication - realized that authentication protocols they (and others) came up with had flaws (unstated assumptions). - bad guys can: (and you might forget there's such a danger if writing the protocol yourself). * store and replay old messages. - don't necessarily know the details of the message, but sending it again later might do something useful. * break keys after the protocol runs. - the difficulty of encrypting messages is related to the length of keys, - the difficulty of enumerating all the keys (to try em against a message) also depends on the length of the keys. => "can't" encrypt forever. * run pieces of the protocol to solicit messages. Method: translate all the messages of a protocol into what they're supposed to mean. What they intend to convey. Then to logically prove, for example, that a specific key is a good key to use to talk between a specific user and a specific server. P and Q are participants in a protocol. P sees m : P sees a message m. Q said X : Q once (at some point) said X. P believes s : P believes a statement s P controls s : P has jurisdiction over s (say, that a user has an account) fresh(X) : X is a new statement. Example: P sees {X}_K : a message of X encrypted with K was seen by P. P believes Q <-K-> P : P believes that K is a shared key between P and Q. therefore: P believes Q said X. In order to promote "said" to "believes", we have to know that X is fresh. Option 1: timestamp. (would have to be in the encrypted part). difficult to securely synchronize clocks. Option 2: a nonce. (random, used only once value that the recipient of the message has made up recently.) Freshness is contageous: if the message includes something that is fresh, everything else in it is fresh too. In order to promote P believes Q believes X to P believes X need to know that Q controls X (that Q has authority or jurisdiction). usually an assumption. Kerberos is an example of a system that can be analyzed using this sscheme. * uses only DES (shared key crypto) * Key Distribution Center (KDC) generates a Ticket Granting Ticket (TGT) for a user, (roughly) encrypted using the user's password. - User's password is a secret shared between the user and the server. Push until later. Nonce <- you know this. Unix password file: there is a salt for each password. salt: random value (12 bits) stored password in the passwd file is: salt + Encrypt ( password (as the key), salt, known string ) in the absence of salt: some nefarious user might take all the likely passwords and encrypt them, then store the list of encrypted likely passwords and compare. ** lecture 25 Security Vocabulary RED, ECN, Vegas (Congestion control cleanup) Security vocabulary bits * authentication - proving you're you. * authorization - proving that you're supposed to have the access. * privacy (confidentiality) - ensuring no one else can see: * what you're doing (applications, messages) * who you're talking to * secure hash (MD5, SHA1) or cryptographic checksum - compared against CRCs (or checksums) really really one way. can't tell how changing a bit in the message will perturb the hash. * zombie - aside from the movies - aside from the orphaned process in a unix system - computer under the control of an evildoer. * botnet - collection of zombies. - often for rent. - behavior of a botnet is diffused and hard to identify inside the network. - "flash crowd" - 'slashdot effect' ("good" not botnet effect) - useful for Denial of Service (dos) attacks. - useful for the spam. * denial of service - no attempt to break into a machine, make stuff public, just slow everything down. (or crash a box) - try to exhaust resources (memory, bandwidth, cpu) * spoofing - stick a false source address on your packets. - difficult to do if you need a TCP connection to complete the request. ** syn attack (denial of service) - send a TCP SYN from a spoofed address - in a normal connection setup, the recipient of the SYN allocates state. buffers, some mapping in a table. - if you're evil, you can cause a server to allocate state, and hold onto it, potentially for minutes. - if you tried from your own source address, the good guys would find you. would call oit, or comcast, and oit might shut down your machine. not effective: might not be hard to filter you out locally. -- you'd get a bunch of syn acks, but you wouldn't care. - spoof different source addresses: destination can't tell the difference between a legit request and one of yours that you have no intention of finishing. ** syn cookie - encode the important pieces of the SYN_RCVD state into the sequence number we send back to the client. - if the syn/ack reached the source address (and it's a legit source address), its ack is what sets up the connection. * ingress filtering % (reverse path filtering) - look at the source address, and try to ensure that the source address makes sense. (belongs to a network in that direction). - "makes sense" - you would forward a packet to that source via the port it was received on. - OIT can protect the internet. - can't protect itself. -> requires universal deployment. -> could still spoof the addresses inside your network. * sniffing - reading other people's traffic. - if you want your traffic to be private, don't send plaintext. - security conferences, some guy will sniff all the passwords they can find, then post a slide on the projector with all such passwords. * smurfing - broadcast ping with a spoofed (victim) source address. - everybody on that subnet sends a response back to the victim. *) don't respond to a broadcast ping (or at least not one from off-subnet) - example of "amplification" my tiny little evil packet can cause hundreds of packets to be sent to a victim. * scanning (port scanning) - port scan: - send your SYN packet to each tcp port on a destination machine. - if you get a SYN/ACK, there's something there. - if you get a RST (reset, one of the TCP flags), there's nothing there. - if you get nothing, likely firewalled. - might also send udp messages looking for "unreachable" messages. - why might people think this is a bad thing? - maybe something is broken and certain people believe that hiding brokenness is security. * firewall - keep the "fire" on the other side of the firewall. only good stuff inside. - a few high profile attacks have been from - compromised laptops behind firewalls. - compromised machines VPN-ing behind the firewall (VPN being a means to tunnel into the space behind the firewall as if the machine is back there.) - deny access it doesn't understand. * IDS (intrusion detection system) - try to permit as much as possible, but recognize any attacks or intrusions. * phishing % (why don't ebay and banks sign their regular spam?) - trying to direct grandma to your web server as if it's a real bank. * fail-closed vs. fail-open. - firewall crashes - > fail-closed -> safe. - IDS crashes -> network still operates -> possibly less safe.(fails open) * IWF (idiot with firewall) - zonealarm. will report incoming pings. - might generate actual abuse reports for protocol compliant things. * optimistic ack - possibly for speeding up your connection. - possibly for hosing the network completely. * man-in-the-middle attack (monkey in the middle) - especially in the context of diffie hellman. * ways to avoid: - encrypt one of the pieces. - sign the key (and encrypt the key with the key) - sign g^a. * all require having some key material associated with the other side. * web of trust - context: PGP - social network over keys. my friend vouches for me. I vouch for you. ultimately, if you send a message to my friend, he can believe it's you. * revocation list - in the context of certificates, a revocation list is roughly an anti-certificate. A signed statement that this key has been lost. - trick is publishing it widely enough. - "usually" the certificate expires before things go bad. Deferred (though much of it already discussed, or in the text.) dictionary attack % (why have a good password) replay attack % (why have a nonce) chosen plaintext attack % (why not to use the same key too often) signatures, certificates shared key kerberos trusted third party BAN logic: Fresh BAN logic: Said BAN logic: Believes BAN logic: Controls * lecture 26 BT evaluation plan: two clients each with half, configured to TFT. complete and keep em happy. I intend to prep some test code for you and have this version due friday 5/16. DNS. Prehistory single hosts.txt file, distributed to every machine. Problems: 1) many many ip addresses. the file would be huge. 2) some guy has to handle the whole file. 3) changes probably frequent, would have to either: a) propagate frequently. b) tolerate stale copies. 4) that "some guy" would have to know which updates are legitimate, and which are bogus attempts to hijack. 5) there was no hierarchy, which means that lots of people could try to name their machines with the same name. ( could consider this a separate problem ) Major features. Domain Name System. - Distributed. nobody knows the whole list. nobody knows every name to ip address binding. - Caching. Many servers will hold on to answers to queries so that later queries for the same name are answered quickly. - Soft state. (cache can die and be repopulated) - Replication. Every name in DNS is provided by authoritative name servers. It is (loosely) mandated that there be at least two authoritative name servers per zone. - Queries are sent (and answered) using UDP. (though there are TCP methods as well.) - Hierarchy (root servers at the top/root of the tree of names) DNS name is a sequence of labels separated by '.' www.cs.umd.edu. What is the difference between "www.cs.umd.edu." and "www.cs.umd.edu" ? (we'll get to this in a second) A single tree. The root is "unnamed". or it is "." The children of the root are Top Level Domains (TLDs) com, org, edu, gov, net, mil << generic TLD tv, cc, uk, us << country code top level domains info << too new to have any reasonable name arpa << special ( in-addr.arpa domain supports reverse name lookup, from IP to name) A zone is a contiguous region of the tree. oit might administer a group of name servers, all of which are authoritative (hold hard-state, not cached name bindings) for the hosts within the following domains : umd.edu psychology.umd.edu chem.umd.edu physics.umd.edu but NOT: cs.umd.edu umiacs.umd.edu which are delegated to machines run by Brad and Fritz by OIT. might get an address for freud.psychology.umd.edu from the same server that provides www.umd.edu while getting an address for www.cs.umd.edu will not come from that server. There are two types of query: Recursive query. Iterative query. A recursive query asks a name server to figure it out, and return only the final answer. An iterative query asks a name server for as much direction as it can give without sending any queries of its own. Your machine (web client) will send the name to a name server you've configured (or learned from DHCP) with a recursive query. This local (recursive) name server will then a) answer the query from its cache b) ask the nameserver authoritative for that name (learning which one is authoritative if necessary by walking down the tree) ** lecture 27 BT testing progress report. RED, ECN, Vegas in the context of all manner of in-network (and end system) sharing schemes. aside: IETF RFCs (TCP is RFC 793) MAY - totally optional. probably a good idea, but optional. SHOULD - we'd make it a requirement, but someone might have a really good excuse. MUST - requirement (there's an RFC that defines the terms in the RFCs) For performance, and happy sharing in the network, we have so far, TCP "Reno" congestion control: aimd, congestion avoidance, slow start, fast retransmission, fast recovery. Reducing by "half" (ssthresh=cwnd/2, cwnd=ssthresh) was designed for really one situation: yours was the only connection on the wire, and now there are two. Not very good: 1) potential to lie in this system (though mostly unused) 2) what happens on very fast, long distance links? rtt very large. bandwidth is very large. => bandwidth delay product is very^2 large. window size that we need in order to keep the pipe full is thus very very large. need v.v. large window. let's say we just halved it. will take very very many very large RTTs to get back. so very high rate, long distance TCP is a mess. 3) two flows using the same bottleneck, one with 100 ms RTT, one with 10ms RTT. -) can sort of assume that the link has a fixed loss rate. (both flows are equally likely (on a per-packet basis) to overflow the queue) the 100 ms connection will take 10x as long to increase his window. Some (not me, but its philosophy) would call this unfair. 4) need a queue at the bottleneck that is of size proportional to the bandwidth delay product of a typical connection traversing the link. - when the loss occurs, there's one b.d.p. in the links of the network, one b.d.p. in the queue that just overflowed, and so the window (cwnd) at the sender is 2x the b.d.p. meaning that it never need go below the size it needs to send at full rate. Goals to fix this status quo: 1) reducing the delay (keeping the queue from growing so large) 2) fairness (giving all the flows a roughly equal share) 3) signaling congestion (telling the sender to slow down) without dropping packets on the floor and causing retransmissions. TCP Vegas. End to end congestion avoidance. Replacement (or add-on) for Reno. All congestion control in TCP is up to the sender. (receivers are assumed to be dumb.) Idea: know (or can learn) the base round trip time. (rtt with no queueing) know the observed bandwidth (we're sending, we can tell how fast we're sending). can figure out if our window is too large (or too small). Expected Rate = Window Size / Base RTT. Actual Rate measured by sending a distinguished packet (store the sequence number and time transmitted) and counting how many bytes have been sent (since that packet) when we see its ack. Actual Rate = Bytes Sent / RTT of distinguished packet. Difference = Expected - Actual If Difference is small ( less than a magic parameter alpha ) i.e., we're getting about as much as we expect to get. - increase the window size a bit. (linearly) If the Difference is large (greater than beta) i.e., we're expecting much more. the network is not delivering. maybe someone else is consuming the bandwidth. - decrease a bit. linearly. not by half. Keeps queues small. certainly can keep them from overflowing. thought to be major downside: it doesn't fight Reno. which will shove Vegas out of a contested queue. which will get a higher cwnd before a loss. (vegas loss has to be reacted to with a cwnd /= 2.) RED. On a router, trying to keep the queue short, manage the average queue occupancy, and drop packets (sending the signal of congestion) only when the average exceeds a threshold. As opposed to drop-tail, this is Random - the router can drop any packet it wants. Early Detection - before the queue overflows. Advantages: - drop before we have to drop lots of packets at once. dropping one is sufficient to tell the senders to slow down. dropping more just forces the sender into terrible recovery mode. (timeout) - by deciding to signal congestion before queue overflows, can do ECN Figure with the graph of drop probability. ECN - Explicit Congestion Notification. - scheme in which convert the dropped packet into a bit. rather that drop a packet as an indication of congestion, can set a bit in the header, which the receiver will forward back to the sender so that the sender behaves as if a packet were lost, just without the retransmission. Each packet can have the ECT bit set by the sender - ECN Capable Transport. "If you mark this packet, I'll pretend it was dropped" If the router wants to mark it, it sets the CE bit - Congestion Experienced. Both bits are in the IP header. -- allows routers to check and flip bits If we're a TCP receiver. And we see a CE bit, Set ECE - ECN Congestion Echo - in every ack until we see a CWR - congestion window reduced - bit from the sender, acknowledging the ECE. These bits are in the TCP header. If we see two adjacent packets both with CE, that gets treated as just one "congestion event"... just the same as if two adjacent packets (or more generally those in the same window) get lost.