Although we could use the ISLR package. We will download the dataset into our local directory.
download.file("http://www-bcf.usc.edu/~gareth/ISL/College.csv", "College.csv")
Load our data frame from the file.
df <- read.csv("College.csv")
Let’s look at it:
head(df)
## X Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University Yes 1660 1232 721 23
## 2 Adelphi University Yes 2186 1924 512 16
## 3 Adrian College Yes 1428 1097 336 22
## 4 Agnes Scott College Yes 417 349 137 60
## 5 Alaska Pacific University Yes 193 146 55 16
## 6 Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1 52 2885 537 7440 3300 450 2200 70
## 2 29 2683 1227 12280 6450 750 1500 29
## 3 50 1036 99 11250 3750 400 1165 53
## 4 89 510 63 12960 5450 450 875 92
## 5 44 249 869 7560 4120 800 1500 76
## 6 62 678 41 13500 3335 500 675 67
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1 78 18.1 12 7041 60
## 2 30 12.2 16 10527 56
## 3 66 12.9 30 8735 54
## 4 97 7.7 37 19016 59
## 5 72 11.9 2 10922 15
## 6 73 9.4 11 9727 55
summary(df)
## X Private Apps
## Abilene Christian University: 1 No :212 Min. : 81
## Adelphi University : 1 Yes:565 1st Qu.: 776
## Adrian College : 1 Median : 1558
## Agnes Scott College : 1 Mean : 3002
## Alaska Pacific University : 1 3rd Qu.: 3624
## Albertson College : 1 Max. :48094
## (Other) :771
## Accept Enroll Top10perc Top25perc
## Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
## 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1110 Median : 434 Median :23.00 Median : 54.0
## Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
## 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
##
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
## Median : 1707 Median : 353.0 Median : 9990 Median :4200
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
##
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
## 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
## Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
## Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
## Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
##
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :13.60 Median :21.00 Median : 8377 Median : 65.00
## Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00
##
Uh oh. We can see that the college names have been converted into factors. This is not especially helpful. Let’s make these into the names of the rows.
rownames(df) <- df[,1]
head(df)
## X Private Apps
## Abilene Christian University Abilene Christian University Yes 1660
## Adelphi University Adelphi University Yes 2186
## Adrian College Adrian College Yes 1428
## Agnes Scott College Agnes Scott College Yes 417
## Alaska Pacific University Alaska Pacific University Yes 193
## Albertson College Albertson College Yes 587
## Accept Enroll Top10perc Top25perc F.Undergrad
## Abilene Christian University 1232 721 23 52 2885
## Adelphi University 1924 512 16 29 2683
## Adrian College 1097 336 22 50 1036
## Agnes Scott College 349 137 60 89 510
## Alaska Pacific University 146 55 16 44 249
## Albertson College 479 158 38 62 678
## P.Undergrad Outstate Room.Board Books
## Abilene Christian University 537 7440 3300 450
## Adelphi University 1227 12280 6450 750
## Adrian College 99 11250 3750 400
## Agnes Scott College 63 12960 5450 450
## Alaska Pacific University 869 7560 4120 800
## Albertson College 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni
## Abilene Christian University 2200 70 78 18.1 12
## Adelphi University 1500 29 30 12.2 16
## Adrian College 1165 53 66 12.9 30
## Agnes Scott College 875 92 97 7.7 37
## Alaska Pacific University 1500 76 72 11.9 2
## Albertson College 675 67 73 9.4 11
## Expend Grad.Rate
## Abilene Christian University 7041 60
## Adelphi University 10527 56
## Adrian College 8735 54
## Agnes Scott College 19016 59
## Alaska Pacific University 10922 15
## Albertson College 9727 55
summary(df)
## X Private Apps
## Abilene Christian University: 1 No :212 Min. : 81
## Adelphi University : 1 Yes:565 1st Qu.: 776
## Adrian College : 1 Median : 1558
## Agnes Scott College : 1 Mean : 3002
## Alaska Pacific University : 1 3rd Qu.: 3624
## Albertson College : 1 Max. :48094
## (Other) :771
## Accept Enroll Top10perc Top25perc
## Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
## 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1110 Median : 434 Median :23.00 Median : 54.0
## Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
## 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
##
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
## Median : 1707 Median : 353.0 Median : 9990 Median :4200
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
##
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
## 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
## Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
## Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
## Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
##
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :13.60 Median :21.00 Median : 8377 Median : 65.00
## Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00
##
We are supposed to use the fix command. Let’s see what that is.
?fix
Okay, that was not especially helpful. Let’s run it.
fix(df)
I see. This gives me an incredibly ugly data frame editor.
Okay let’s get rid of the unneccessary column.
df <- df[,-1]
Note that the “-1” here means that we will select all of the columns except the first one. This behavior is different from python where -1 will typically refer to the last column and -2 will refer to the second to last column.
Let’s see this in our ugly editor again.
fix(df)
Okay, let’s look at our cleaned data frame:
summary(df)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Good, the unnecessary factor data is gone and apparently some school has an awful graduation rate.
Which one is it?
which.min(df$Grad.Rate)
## [1] 586
Who is this culprit:
df[586,]
## Private Apps Accept Enroll Top10perc Top25perc
## Texas Southern University No 4345 3245 2604 15 85
## F.Undergrad P.Undergrad Outstate Room.Board
## Texas Southern University 5584 3101 7860 3360
## Books Personal PhD Terminal S.F.Ratio
## Texas Southern University 600 1700 65 75 18.2
## perc.alumni Expend Grad.Rate
## Texas Southern University 21 3605 10
Wow, we should avoid Texas Southern University.
Now, let us look at a pairwise scatterplot of the first ten columns.
pairs(df[,1:10])
That was ugly. Let’s try this again.
library(GGally)
ggpairs(df[,1:10])