Final Project
This project can be done in groups of up to 4.
The goal of this project is for you to carefully build a social network
and to analyze it using the knowledge and skills you have developed in
class. You will be required to produce evidence to support any of your
claims.
Choose a social network. This can be an online network (Facebook, twitter,
YouTube), a network extracted from other data (like your email network,
discussion boards, interactive game playing), or built to represent a
network that exists offline.
- Define the nodes in the network. Who are they and what do you want to
know about them?
- What are the links in the network? What do they represent?
- Collect the data for the network. Using the definitions you have
chosen for nodes and links, build a representation of the network (an
adjacency list or matrix). You may use tools like NodeXL to get this
data,
or you can build the network by hand. However, you must have real data for
all the entities; you cannot just make up connections or hypothesize what
they may be. If you want to use a simulated network, you must have good
basis for your simulation and use tested techniques. This is not
recommended unless you have a good reason for doing it. If you want to
simulate a network, you must talk to me first.
Once you have the data, you should perform the following tasks.
- Visualize this network. You can create a visualization in
Gephi, NodeXL,
ManyEyes, some other software, or you can draw it by hand.
- Identify structural features of the network. Are there interesting
hubs or clusters? Is there a high or low clustering coefficient? Does it
look like a small world network? Describe interesting features.
- Explain how those structural features relate to real factors in the
network. For example, if you were studying the network of Congresspeople
who use Twitter and you see the John McCain has more followers than anyone
else, you can explain this by the fact that he ran for President. All your
explanations must be grounded in fact with supporting evidence from
primary sources; it cannot be an intuitive guess.
Here are some example projects:
- Email Network: You create a social network from your email (or
someone else's email, e.g. the Enron Email Corpus). The nodes
are people and the edges indicate that people were on the same email. This
would add a link between the sender and all the recipeints. People cced on
the same message could also be linked. Edges can be labeled with the
number of messages exchanged. They can also be directed (Alice -> Bob is
Alice emailed bob, and Bob->Alice if Bob emailed her back).
Who are your
most frequent correspondents? Which nodes are most central? Using your own
knowledge of your strong ties and weak ties, does centrality, degree,
frequency of emails, or other factors relate to the tie strength? Does
your email network have clusters of people you email for different things
(e.g. family, classmates, work people, friends, etc)?
- Discussion board network: Based on posted questions and replies, build
a network of people who interact on a discussion board online. You could
also include topics as nodes in the network and connect people to the
topics they have discussed. Who are the
most central people? Do they start a lot of discussions or mostly reply?
Do they engage in a lot of back and forth or do they disappear? Who
connects to the most topics? Are those people more likely to share
information? Can you find different types of users based on how they look
in the social network?
- Sexual Contacts Network: Admittedly, it's hard to get this kind of
data where people are your nodes and they are linked if they slept
together. However, it's an *extremely* common kind of network studied in
epidemeology to better understand transmission of STDs. Who is most
central in the network you create? Who has a high degree or low degree?
Based on what we learn about spreading of disease, what is the best way to
stop it? Which nodes do you target, when, and how?
- I have a few research projects that could be done as class projects.
These are a good opportunity for students interested in possibly going on
to graduate school, since we will try to generate publishable results.
These look very good on grad school applications. If you're interested in
something like that, please let me know.
- You are also welcome to choose your own topic. If you want to do that,
please email Dr. Golbeck to discuss your ideas ahead of the first
deadline.
Timeline
Note: These deadlines are mostly there to ensure you are making progress
at the right rate to successfully complete the project. I will provide
some guidance in the early stages of the project, but I will not be
reviewing your drafts nor making comments about what you need to change to
get an A on the paper. I will not grade your assignments ahead of time. If
you have specific questions, please ask, but do not send your full paper
and just ask me to look it over and comment. I won't do it.
- April 3: Short (1-2 paragraph) description of your network, data
source, and groups chosen. Email this info to Dr. Golbeck with subject
line "INFM289I Project Update". Include an estimate of how big your
network will be. I will review these and let you know if they are big
enough or too big. Groups should have networks that are significantly
larger than one person could work with on their own; each member must
collect the same amount of data that they would if they were working
alone.
- April 10: Data collection should be complete. An adjacency list for
your network along with a 1/2 page description of the collection process
is due, emailed to Dr. Golbeck with the subject line "INFM289I Project
Update".
- April 17: Visualizations and 1/2 - 1 page list of bullet points
describing interesting features of the network emailed to Dr. Golbeck with
the subject line "INFM289I Project
Update".
- April 24: 2 page single spaced extended outline of paper due. This
should include
descriptions of all the analysis questions, short answers that you will
support with evidence, and data collection methods.
- May 1: Complete draft of papers due.
- May 1 and 3: In class presentations
- May 8: Final Papers due in class
You can work alone or in groups of up to 4. The workload should scale with
the number of group members, i.e. a group of 4 must produce 4X the work of
a person working alone.
Final papers should be 5 pages single spaced for a person working alone
and for groups there should be an additional 4 pages for each additional
member (i.e. a group of 2 needs a 9 page paper, a group of 3 needs a 13
page paper, and a group of 4 needs an 17 page paper). This means that you
should increase the number of analysis questions per person in order to
substantially increase the size of the project for groups. Graphs, charts,
visualizations, tables, etc. all count to your page total.
Keep this in mind - if you do not feel like you can fill 5 single spaced
pages with analysis of your network, then you have picked something too
simple.
Things I want to see in the final paper:
- What is the network you looked at? If appropriate, why does this
network
exist or what is it about?
- How did you collect your data? Was it by hand? Was the network already
available? Why did you make the choices you did?
- Show visualizations. They count toward the page total but don't go
overboard. They shouldn't be HUGE nor should you use a dozen of them just
to take up space.
- Analysis - this is most important. Tell me interesting things you
discovered based on collecting and looking at this network. How do
features like centrality, tie strength, clustering, etc. relate to actions
or roles in the network? Use all the features from class that are
approrpriate. If you apply a network principle (e.g. Small Worlds) explain
why that is appropriate (e.g. give the statistics that show a high
clustering coefficient and low average shortest path length).
Things I *don't* want:
- A list describing each person in the network and their relationship to
others. E.g. "Bob is the father in this network. Jim, Joe, and Frank are
his sons. Jim is the oldest son. He is a carpenter and likes social
networks..." There's no analysis there and you won't get much credit for
this.
- A list of ations, e.g. "Bob commented 3 times on Franks posts. Frank
commented 10 times on Jim's posts and 4 times on Joe's posts". A little of
that is ok, but it's also not analysis and it bores me. It also shows you
haven't learned much if this is the best you can do!
Standard things you should be doing anyway: All statements you make must
have evidence to support them. If you are using outside sources, cite them
properly. Everything you submit must be your own work. Use standard 1"
margins with 12 point times font.
Grading
- 10% for meeting each of the first 5 deadlines above. No credit for
late
submissions. This is 50% total.
- 20% in-class presentation
- 30% final paper
Paper Grading
- Length: 2
- Writing quality: 2
- Description of data collection process: 1
- Visualization present: 1
- Analysis: 6
- Note: Depending on your network and the process, I may weight certain
things toward analysis. For example, if you wrote a lengthy computer
program to gather complex data, description of that code and analysis of
the data online would count toward this.