A statistical analysis of The Facebook at Salisbury University

I should pre-empt this article by saying that I am an amateur as far as statistics goes, graph-visualization theory/algorithms, and probably everything else: the pains of being an auto-didact.

For the past few months I have off-and-on dabbled with a project relating to statistical analysis of the facebook at Salisbury University.

Specifically, I ran a screenscraper which stored data in a plain-text file about every available piece of information of every member registered as a student on Salisbury University's facebook page. Why Salisbury University?

  1. I am an ex-student there (unfortunately), so I still have an email account to sign-up for an account on the Salisbury facebook.
  2. My current school, Rutgers University, would be far too large for the scope of this study.

My first interest was in using graph-visualization algorithms to see if any patterns emerged based on the connections (edges) between the various users (nodes), the edges of course representing friendship. I ended up using the GEM3D algorithm to visualize this, through a rather shitty program called Tulip.

The original graph visualization, which took in excess of 4 hours on my computer, was an absolute mess. No matter how you render 4245 nodes and 142,208 edges, the results will rarely offer any insights. Thus I decided to take a random sample of 250 people, and make a visualization of their connections.

The 3d renderings produced by Tulip were, shall we say, subpar, so I wrote a script which translated them into Povray's syntax. Just for kicks I color coded each node according to the user's birthdate and interpolated the colors between edges. Here's what resulted:


facebook visualization of random 250 users at Salisbury University

Clearly, not very useful (other than demonstrating the obvious fact that similar age groups cluster together), but eyecandy nonetheless.

I then decided it might be interesting to geocode people's listed home-town, and then do a scatter plot over a map of the United States. Given that the school is located in southern Maryland, the results are fairly predictable, but interesting nevertheless.


scatter plot of home-town locations over US map

Later, I decided to do something more useful with my data. I ran Pearson's correlations between the variables.1 If this is all greek to you, read foodnote 2. I've also provided a javascript-enabled table of the correlations results [warning: The javascript in this file is very unoptimized and may crash your browser or at the very least stall it.] For those comfortable with the pdf format, or who wish to print out the results for later viewing can view the pdf version here (with additional graphs!).

Highest corr. coefficients sorted

Variable 1Variable 2Corr.
School yearBirthdate0.829
Phys. EdHealth Ed.0.296
BiologyChemistry0.274
Comp. Sci.Mathematics0.211
MaleLiberal-0.205
MaleOtherPoli0.205
MaleBus. Admin.0.189
AccountingFinance0.190
MaleNursing-0.184
MaleElementary Ed.-0.182
PsychologySocial Work0.173
CommunicationMarketing0.171
HistorySecondary Ed.0.177
MalePsychology-0.169
MathematicsPhysics0.162


Most of the results are pretty obvious, as the summary table above demonstrates, but there are a few surprises in the full correlation data set. For instance: why is there a highly significant correlation between being bisexual and being a respiratory therapy major?

Anyway, I have around 13 MB of data about Salisbury University, and tons of scripts to analyze it, so if anyone has any ideas of interesting things I can do with it other than correlations and graph-visualizations, please post in the comments and I will most likely oblige.


Notes

1. Some might question the use of Pearson's correlations, given the fact that much of the date is non-parametric. However, it's fairly well-known that parametric methods are often fine when the sample size is large enough. Additionally, non-parametric correlation coefficients, such as those derived from Spearman's r, where in fact larger that the Pearson's correlation coefficients. Thus, I decided to play it safe and stick with the smaller of the two. [Back]

2. I'm not going to go too much in detail here, just enough to allow someone without any knowledge of statistics to interpret the correlation data. A positive correlation means that, in general, the higher one variable is, the more likely that the other variable is higher than normal, and vice-versa. Conversely, a negative correlation means that, in general, the higher one variable, the more likely that the other variable is lower than normal, and again vice-versa. The correlation coefficient is a number between -1 and 1, these two extremes representing a perfect negative correlation or a perfect positive correlation. Thus, the higher (in terms of absolute values) the correlation coefficient, the stronger the correlation. The significance takes into account the number of people sampled and the correlation coefficient and determines the chances that the correlational relationship occurred simply by chance. According to convention, a significance value of less than 0.05 is considered statistically significant, and a significance value of less than 0.01 is considered highly statistically significant. Lastly, the n that you see in the tables is simply the sample size (that is, the number of people who entered valid values for both variables). [Back]

Posted by Frankie on 08 Jan 2006 @ 08:39 AM | statistics salisbury facebook graphs


Comments


your buddy ryan says:

Here is something you might enjoy Frankie:

http://www.bestbuy.com/site/olspage.jsp?type=category&id=pcmcat81900050045

Peace Rivers!

Posted by: your buddy ryan at January 27, 2006 11:55 PM



Matt says:

Hi,

I am a Ph.D. student at the University of Arizona and am in the early stages of research/writing a disseration that involves tracking some Facebook data. I am also taking an advanced statistics course that requires that we analyze a specific data set. Could you tell me what a 'screenscraper' is and how you used it to collect your data?

Matt

Posted by: Matt at February 6, 2006 02:36 PM



Frankie says:

Certainly, I'll email you the details.

Posted by: Frankie at February 6, 2006 05:20 PM



Pete says:

Oh wow, I definitely recall hearing the suggestive 'reticulated spines' at the start of each simcity 2000 session.
Statistics you say? Are there any mechanics involved?

Posted by: Pete at February 15, 2006 01:08 AM



lokimikoj says:

Hi all!

Very nice work, admin! Have advised. It is healthy.

Posted by: lokimikoj at September 21, 2007 11:41 PM



Post a comment









Remember personal info?