Mudit Agrawal

mudit@cs.umd.edu

Exploring factors for ranking among various universities

using SpotFire and HCE

Goal:

The goal is to study various patterns (like education quality, faculty reputation, publications, activity measures, program size, funding etc.) in Computer Science department across various universities.

Problem:

A lot of datasets exist on the graduate and undergraduate school rankings – still the problem with majority of them is that they are sorted across total points and have fixed weights on each variable. As a PhD candidate, one is more focused on some specific aspects of the program in schools. Moreover, current ranking of schools should also be compared with the factors on which they depend, in order to get the bigger picture.

For any classification or sorting problem, there are many features that contribute to the results. It is not necessary that all these features form a total independent set. In other words, these factors or features may show a dependency on each other contributing to the fact that dimensionality reduction might be possible. E.g. in a character recognition problem, the length and breadth of a character may contribute more to classification rather than the mean. Also, area of a character might be dependent on length and breadth, hence does not contribute significantly as a separate feature.

Hence, visualization of all these aspects is pretty much the need of any new candidate – either as a faculty member or a student.

Dataset:

Customized rankings of graduate programs based on National Research Council data available at:

http://www.phds.org/rankings/

Top 100 academic institutions in CS department were taken

Data from Science and Engineering Indicators 2004 (SOURCES: National Science Foundation, Division of Science Resources Statistics (NSF/SRS), Survey of Doctorate Recipients; and NSF/SRS, Survey of Graduate Students and Post-doctorates in Science and Engineering, unpublished tabulations.) Top 100 academic institutions in R&D expenditures (in millions of dollars), by source of funds: 2001

The figure 1 shown below is a sample .xls file from our datasets. The various parameters and their description are as follows:

Score Score of the customized university
Ed Eff Program effectiveness in educating research scholars and scientists. Scale of 0 to 5, with 0 denoting "Not Effective" and 5 denoting "Extremely Effective". (Source: National Survey of Graduate Faculty)
Change Change in program quality in the last five years, Scale of -1 to 1, with -1 denoting "Poorer than 5 years ago" and 1 denoting "Better than 5 years ago".
Years for PhD Median time lapse from entering graduate school to receipt of Ph.D. in years. Source: Doctorate Records File
Fac Qual Scholarly quality of program faculty, Scale of 0 to 5, with 0 denoting "not sufficient for doctoral education" and 5 denoting "Distinguished". Source: National Survey of Graduate Faculty
% Fac Pub Percentage of program faculty publishing in the period. Source: Institute for Scientific Information
Pub / Fac The ratio of the total number of program publications to the number of program faculty. Your Source: Institute for Scientific Information
Cite / Fac The ratio of the total number of program citations to the number of program faculty. Source: Institute for Scientific Information
Gini Pub Gini coefficient for faculty publications. Source: Institute for Scientific Information
Gini Cite Gini coefficient for faculty citations. Source: Institute for Scientific Information
% Full Prof Percentage of full professors participating in the program. Source: Institutional Coordinator Response Data
# Fac Total number of faculty participating in the program. Source: Institutional Coordinator Response Data
# Stu The number of full and part time graduate students enrolled. Source: Institutional Coordinator Response Data
# of PhDs The number of Ph.D.s produced by the program for the period academic year Source: Institutional Coordinator Response Data
% Fac Supp Percentage of program faculty with research support. Source: Federal Agencies
% RA The percentage of Ph.D.s supported by research assistantships (as a percentage of Ph.D.s who reported their primary form of support. Source: Doctorate Records File
% TA The percentage of Ph.D.s supported by teaching assistantships (as a percentage of Ph.D.s who reported their primary form of support. Source: Doctorate Records File
% Fem Stu The percentage of full and part time female graduate students enrolled. Source: Institutional Coordinator Response Data
% Fem PhDs The percentage of Ph.D.s awarded to women. Source: Doctorate Records File
% Min PhDs The percentage of Ph.D.s known to be awarded to underrepresented minorities (only U.S.Citizens or Permanent Residents). Source: Doctorate Records File
% US PhDs The percentage of Ph.D.s known to be awarded to U.S. Citizens and Permanent Residents Source: Doctorate Records File
Public/Private
All sources: Funding from all sources
Federal government: Funding from Federal Government
State/local government: Funding from State/Local government
Industry: Funding from Industry
Academic institutions: Funding from Academic Institutions
All other sources: Funding from other sources

Figure 1

Result 1: Qualified Faculty increases Education Efficiency (*Using Spotfire*)

The aim was to figure out any relation between Education efficiency and Faculty Qualification. When a scatter plot was visualized, these two factors seem to closely relate to each other signifying that qualified faculty increases education efficiency.

The color coding (red à blue) signifies the score of the universities with the corresponding education efficiency and faculty qualification values. Its clear that as both these values increase, the rank of the university gets better.
The size of the data-points is directly proportional to the Gini Pub value. Gini Pub represents the gini coefficients for faculty publications - a measure of inequality to measure uneven distribution of publications among faculty. It is clear from the graph that lower-ranked universities have higher gini coefficient – i.e. – very less % of faculty contribute to good research publications!

“Evenly distributed Qualified Faculty increases Education Efficiency which leads to higher scores”

“Lower ranked universities have higher variation among faculty qualifications than higher ranked universities”

Result 2: Faculty to Student ratio is higher in good-ranked universities (*Using Spotfire*)

The data did not give any relation between number of students and faculty. Hence, in order to determine the relation between them contributing towards the rank of the universities, the scatter graph between #Faculty and #Students was plotted. The color coding (red à blue) signifies the score of the universities.

The plot shows that for the same number of students (e.g. 100-200 range), as faculty increases, the score of the universities get higher (becomes bluer).

“Higher Faculty to student ratio implies better universities”

Result 3: Are higher-ranked universities more researchy? (*Using Spotfire*)

To know the answer of this question, numbers of PhDs in CS department across various universities were plotted against %TA (% of those PhDs who got TAship). As it’s clear from the graph, for higher ranked universities (bluer data-points), the %TA of PhD candidates was much smaller than for lower ranked universities!

(Exception being shown as Illinois Inst. Of Tech. which is lower ranked yet has lower number of %TA PhD – reason being the very poor #Fac)

To substantiate this result, it was essential to see whether higher-ranked universities have higher %RAs. The graph shown below shows the same selected top-ranked universities having much higher %RA (and IIT being an outlier again)

Result 4: Universities vary most in #students, %RAship, %TAship and unevenness in publications (*using HCE*)

Sometimes simple tables effuse information when combined with some graphs. Mean and Standard Deviation for each factor (across all univs) was calculated, and the top 5 most varied parameters were found to be:

Number of students
Gini Cite
%TA
Faculty Support
%RA

Also shown are two scatter plots, one showing the proportionality relationship between faculty qualification and publications per faculty and other showing the GiniPub verses plot in clustering order. It should be noted again that universities having lower gini-pub are the ones which have higher faculty qualifications & pub/faculty (shown as selected data-item in both the plots)

Result 5: Consistency is the key to be at the top (*using HCE*)

Dendrogram view of all parameters (on parallel axes) was shown for all universities. When flooring was done to look into the top few universities, the clear variation from the silhouette is visible – showing that the top ranked universities are consistently good at all measured aspects instead of toping at one or other factors. The selected one shows the rank 1 university

Result 6: Private schools though lesser in number, have more federal or other funding (on an average) than public ones! (*Using Spotfire*)

When Federal funds and R&D expenditure were plotted against ranking, a linear relationship appeared. However, it is noteworthy that for a given range of ranks, private schools have more R&D expenditure than public ones.

The pie-chart shows that only 29% schools are private and contribute much less R&D expenditure cumulatively (right-top bar graph), but on an average, private schools have more R&D expenditure than public ones (bottom-left bar graph)

Conclusion:

Various relations between factors contributing to a score of universities were studied using Spotfire and HCE Visualization Tool. The ease with which we can play with the data, visualize it using different color and size of data-points is remarkable. Even a single scatterplot can convey lots of information. The dynamic linking of various graphs (as in Result 4 & 6) gives a clear idea how each university (or data-item) vary in different perspectives. Since every product has some space for improvement, so do Spotfire and HCE.

Spotfire:

Right click to sort, arrange etc. is for all queries, not specific to that query. This is a bit confusing as other parameters are specific to that query/parameter.
To deselect a selected data-item, clicking on an empty space doesn’t work. You have to select an empty area.
Naming convention for new added/calculated columns is not straightforward

HCE:

Ceiling and flooring limit to the point where they meet each other. This curtails on selecting high as well as low valued data-items and ignoring the middle-ones.

Nevertheless, both tools give a lot of flexibility to the user to perform complex tasks with neat visualization. E.g. k-means clustering on school-data shown below clusters various schools in 9 (user defined) clusters pretty decently. Both tools are a great help in the area of information visualization and are boon to the researchers to better understand their data.