Specific requirements

In this programming project, complete the following steps on how to implement one of the unsupervised learning algorithms to classify a data set:

  • Review theoretical background of cluster analysis.
  • Select a data set and load it into R (see the R data repository: http://vincentarelbundock.github.io/Rdatasets/datasets.html
  • Perform cluster or principal component analysis of the data.
  • Interpret and discuss the results.
  • Submit a *.html file generated by R Markdown.

Data frame information

Name

Homes in Northampton MA Near Rail Trails.

Description

Sample of homes in Northampton, MA to see whether being close to a bike trail enhances the value of the home.

Format

A data frame with 104 observations on the following 30 variables.

HouseNum

Unique house number

Acre

Lot size for the house (in acres)

AcreGroup

Lot size groups (<= 1/4 acre or > 1/4 acre)

Adj1998

Estimated 1998 price (in thousands of 2014 dollars)

Adj2007

Estimated 2007 price (in thousands of 2014 dollars)

Adj2011

Estimated 2011 price (in thousands of 2014 dollars)

BedGroup

Bedroom groups (1-2 beds, 3 beds, or 4+ beds)

Bedrooms

Number of bedrooms

BikeScore

Bike friendliness (0-100 score, higher scores are better)

Diff2014

Difference in price between 2014 estimate and adjusted 1998 estimate (in thousands of dollars)

Distance

Distance (in feet) to the nearest entry point to the rail trail network

DistGroup

Distance groups, compared to 1/2 mile (Closer or Farther Away)

GarageSpaces

Number of garage spaces (0-4)

GarageGroup

Any garage spaces? (no or yes)

Latitude

Latitude (for mapping)

Longitude

Longitude (for mapping)

NumFullBaths

Number of full baths (includes shower or bathtub)

NumHalfBaths

Number of half baths (no shower or bathtub)

NumRooms

Number of rooms

PctChange

Percentage change from adjusted 1998 price to 2014 (value of zero means no change)

Price1998

Zillow 10 year estimate from 2008 (in thousands of dollars)

Price2007

Zillow price estimate from 2007 (in thousands of dollars)

Price2011

Zillow price estimate from 2011 (in thousands of dollars)

Price2014

Zillow price estimate from 2014 (in thousands of dollars)

SFGroup

SquareFeet group (<= 1500 sf or > 1500 sf)

SquareFeet

Square footage of interior finished space (in thousands of sf)

StreetName

Street name

StreetNum

House number on street

WalkScore

Walk friendliness (0-100 score, higher scores are better)

Zip

Location (1060 = Northampton or 1062 = Florence)

Detail

This dataset comprises 104 homes in Northampton, MA that were sold in 2007. The authors measured the shortest distance from each home to a railtrail on streets and pathways with Google maps and recorded the Zillow.com estimate of each home’s price in 1998 and 2011. Additional attributes such as square footage, number of bedrooms and number of bathrooms are available from a realty database from 2007. We divide the houses into two groups based on distance to the trail (DistGroup).

Source

From July 2015 JSE Datasets and Stories: “Rail Trails and Property Values: Is There an Association?”, Ella Hartenian, Smith College and Nicholas J. Horton, Amherst College. http://www.amstat.org/publications/jse/v23n2/horton.pdf

Include the knitr package for integration of R code into Markdown

knitr::opts_chunk$set(echo = TRUE)

All the libraries used in this code

library(dplyr)
library(tidyr)
library(purrr)
library(readr)
library(factoextra)
library(ggfortify)
library(stats)
library(cluster)
library(psych)
library(devtools)
library(ggbiplot)
library(ggalt)
library(ggforce)
library(ggplot2)

Import data

input_data <- read_csv("RailsTrails.csv")
## Warning: Missing column names filled in: 'X1' [1]

Numeric/character field separator

input_data1 = input_data[,-1]
num.names <- input_data1 %>% select_if(is.numeric) %>% colnames()
ch.names <- input_data1 %>% select_if(is.character) %>% colnames()

Descriptive statistics Before Data Processing

Dimension of data frame

dim(input_data1)
## [1] 104  30

Structure of data frame

str(input_data1)
## tibble [104 x 30] (S3: tbl_df/tbl/data.frame)
##  $ HouseNum    : num [1:104] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Acre        : num [1:104] 0.28 0.29 0.36 0.26 0.31 0.31 0.08 0.11 0.31 0.27 ...
##  $ AcreGroup   : chr [1:104] "> 1/4 acre" "> 1/4 acre" "> 1/4 acre" "> 1/4 acre" ...
##  $ Adj1998     : num [1:104] 148 135 257 232 272 ...
##  $ Adj2007     : num [1:104] 234 261 401 305 299 ...
##  $ Adj2011     : num [1:104] 192 207 348 257 237 ...
##  $ BedGroup    : chr [1:104] "3 beds" "3 beds" "3 beds" "3 beds" ...
##  $ Bedrooms    : num [1:104] 3 3 3 3 4 3 3 4 5 3 ...
##  $ BikeScore   : num [1:104] 35 44 66 61 53 36 97 95 38 30 ...
##  $ Diff2014    : num [1:104] 62.4 69 82.1 44.6 -102.7 ...
##  $ Distance    : num [1:104] 2.4 1.97 0.0434 0.5547 0.5966 ...
##  $ DistGroup   : chr [1:104] "Farther Away" "Farther Away" "Closer" "Farther Away" ...
##  $ GarageSpaces: num [1:104] 2 1 2 1 0 1 0 0 0 0 ...
##  $ GarageGroup : chr [1:104] "yes" "yes" "yes" "yes" ...
##  $ Latitude    : num [1:104] 42.3 42.3 42.3 42.3 42.3 ...
##  $ Longitude   : num [1:104] -72.7 -72.7 -72.7 -72.7 -72.7 ...
##  $ NumFullBaths: num [1:104] 1 1 2 1 1 1 1 1 2 1 ...
##  $ NumHalfBaths: num [1:104] 0 0 1 1 0 1 0 1 0 0 ...
##  $ NumRooms    : num [1:104] 5 5 7 6 6 6 6 9 7 5 ...
##  $ PctChange   : num [1:104] 42 51 32 19.2 -37.8 ...
##  $ Price1998   : num [1:104] 101.5 92.5 175.5 158.5 186 ...
##  $ Price2007   : num [1:104] 204 228 349 266 260 ...
##  $ Price2011   : num [1:104] 181 195 328 243 223 ...
##  $ Price2014   : num [1:104] 211 204 339 276 169 ...
##  $ SFGroup     : chr [1:104] "<= 1500 sf" "<= 1500 sf" "> 1500 sf" "> 1500 sf" ...
##  $ SquareFeet  : num [1:104] 0.966 0.96 1.725 1.727 1.576 ...
##  $ StreetName  : chr [1:104] "Acrebrook Drive" "Autumn Dr" "Bridge Road" "Bridge Road" ...
##  $ StreetNum   : num [1:104] 406 57 31 200 395 ...
##  $ WalkScore   : num [1:104] 9 5 46 40 32 12 82 88 15 9 ...
##  $ Zip         : num [1:104] 1062 1062 1062 1060 1062 ...

Summary statistics of data frame

summary(input_data1)
##     HouseNum           Acre         AcreGroup            Adj1998      
##  Min.   :  1.00   Min.   :0.0500   Length:104         Min.   : 60.66  
##  1st Qu.: 26.75   1st Qu.:0.1675   Class :character   1st Qu.:167.00  
##  Median : 52.50   Median :0.2500   Mode  :character   Median :200.62  
##  Mean   : 52.50   Mean   :0.2574                      Mean   :208.57  
##  3rd Qu.: 78.25   3rd Qu.:0.3300                      3rd Qu.:228.39  
##  Max.   :104.00   Max.   :0.5600                      Max.   :470.67  
##     Adj2007         Adj2011        BedGroup            Bedrooms   
##  Min.   :162.6   Min.   :141.7   Length:104         Min.   :1.00  
##  1st Qu.:260.6   1st Qu.:215.8   Class :character   1st Qu.:3.00  
##  Median :303.6   Median :258.9   Mode  :character   Median :3.00  
##  Mean   :327.6   Mean   :284.5                      Mean   :3.25  
##  3rd Qu.:349.5   3rd Qu.:325.1                      3rd Qu.:4.00  
##  Max.   :798.6   Max.   :698.5                      Max.   :6.00  
##    BikeScore        Diff2014          Distance        DistGroup        
##  Min.   :18.00   Min.   :-199.87   Min.   :0.03883   Length:104        
##  1st Qu.:36.00   1st Qu.:  44.33   1st Qu.:0.32879   Class :character  
##  Median :54.50   Median :  71.39   Median :0.76042   Mode  :character  
##  Mean   :57.28   Mean   :  84.53   Mean   :1.11432                     
##  3rd Qu.:77.25   3rd Qu.: 106.87   3rd Qu.:1.89579                     
##  Max.   :97.00   Max.   : 497.82   Max.   :3.97678                     
##   GarageSpaces    GarageGroup           Latitude       Longitude     
##  Min.   :0.0000   Length:104         Min.   :42.30   Min.   :-72.73  
##  1st Qu.:0.0000   Class :character   1st Qu.:42.32   1st Qu.:-72.68  
##  Median :1.0000   Mode  :character   Median :42.32   Median :-72.66  
##  Mean   :0.7596                      Mean   :42.33   Mean   :-72.66  
##  3rd Qu.:1.0000                      3rd Qu.:42.33   3rd Qu.:-72.64  
##  Max.   :4.0000                      Max.   :42.35   Max.   :-72.61  
##   NumFullBaths    NumHalfBaths       NumRooms        PctChange     
##  Min.   :1.000   Min.   :0.0000   Min.   : 4.000   Min.   :-46.75  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 5.000   1st Qu.: 26.54  
##  Median :1.000   Median :0.0000   Median : 6.500   Median : 37.61  
##  Mean   :1.452   Mean   :0.2212   Mean   : 6.615   Mean   : 42.20  
##  3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.: 7.250   3rd Qu.: 51.17  
##  Max.   :4.000   Max.   :1.0000   Max.   :14.000   Max.   :130.49  
##    Price1998       Price2007       Price2011       Price2014    
##  Min.   : 41.5   Min.   :141.5   Min.   :133.8   Min.   :132.1  
##  1st Qu.:114.2   1st Qu.:226.8   1st Qu.:203.8   1st Qu.:212.9  
##  Median :137.2   Median :264.2   Median :244.4   Median :272.9  
##  Mean   :142.7   Mean   :285.1   Mean   :268.6   Mean   :293.1  
##  3rd Qu.:156.2   3rd Qu.:304.1   3rd Qu.:306.9   3rd Qu.:334.2  
##  Max.   :322.0   Max.   :695.0   Max.   :659.5   Max.   :879.3  
##    SFGroup            SquareFeet     StreetName          StreetNum     
##  Length:104         Min.   :0.524   Length:104         Min.   :   1.0  
##  Class :character   1st Qu.:1.206   Class :character   1st Qu.:  27.0  
##  Mode  :character   Median :1.516   Mode  :character   Median :  63.5  
##                     Mean   :1.566                      Mean   : 137.8  
##                     3rd Qu.:1.832                      3rd Qu.: 155.0  
##                     Max.   :4.030                      Max.   :1086.0  
##    WalkScore          Zip      
##  Min.   : 2.00   Min.   :1060  
##  1st Qu.:14.75   1st Qu.:1060  
##  Median :36.00   Median :1062  
##  Mean   :38.88   Mean   :1061  
##  3rd Qu.:60.75   3rd Qu.:1062  
##  Max.   :94.00   Max.   :1062

Glimpse of data frame

glimpse(input_data1)
## Rows: 104
## Columns: 30
## $ HouseNum     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Acre         <dbl> 0.28, 0.29, 0.36, 0.26, 0.31, 0.31, 0.08, 0.11, 0.31, ...
## $ AcreGroup    <chr> "> 1/4 acre", "> 1/4 acre", "> 1/4 acre", "> 1/4 acre"...
## $ Adj1998      <dbl> 148.3625, 135.2072, 256.5283, 231.6795, 271.8762, 192....
## $ Adj2007      <dbl> 233.8418, 261.4203, 401.0359, 305.0861, 298.7660, 275....
## $ Adj2011      <dbl> 191.8211, 206.9677, 347.9472, 257.4915, 236.6253, 243....
## $ BedGroup     <chr> "3 beds", "3 beds", "3 beds", "3 beds", "4+ beds", "3 ...
## $ Bedrooms     <dbl> 3, 3, 3, 3, 4, 3, 3, 4, 5, 3, 3, 3, 4, 3, 4, 3, 5, 3, ...
## $ BikeScore    <dbl> 35, 44, 66, 61, 53, 36, 97, 95, 38, 30, 46, 38, 79, 85...
## $ Diff2014     <dbl> 62.36645, 68.96375, 82.13365, 44.57055, -102.70320, 18...
## $ Distance     <dbl> 2.40000000, 1.97000000, 0.04337121, 0.55473485, 0.5965...
## $ DistGroup    <chr> "Farther Away", "Farther Away", "Closer", "Farther Awa...
## $ GarageSpaces <dbl> 2, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, ...
## $ GarageGroup  <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "no", "...
## $ Latitude     <dbl> 42.31533, 42.29856, 42.34379, 42.34446, 42.34253, 42.3...
## $ Longitude    <dbl> -72.69397, -72.67474, -72.68023, -72.67221, -72.66437,...
## $ NumFullBaths <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, ...
## $ NumHalfBaths <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ NumRooms     <dbl> 5, 5, 7, 6, 6, 6, 6, 9, 7, 5, 5, 7, 8, 6, 9, 6, 7, 5, ...
## $ PctChange    <dbl> 42.036518, 51.005956, 32.017377, 19.238025, -37.775723...
## $ Price1998    <dbl> 101.5, 92.5, 175.5, 158.5, 186.0, 132.0, 117.0, 146.0,...
## $ Price2007    <dbl> 203.5, 227.5, 349.0, 265.5, 260.0, 240.0, 264.5, 331.5...
## $ Price2011    <dbl> 181.1, 195.4, 328.5, 243.1, 223.4, 229.7, 281.3, 357.6...
## $ Price2014    <dbl> 210.729, 204.171, 338.662, 276.250, 169.173, 211.487, ...
## $ SFGroup      <chr> "<= 1500 sf", "<= 1500 sf", "> 1500 sf", "> 1500 sf", ...
## $ SquareFeet   <dbl> 0.966, 0.960, 1.725, 1.727, 1.576, 1.320, 1.202, 2.136...
## $ StreetName   <chr> "Acrebrook Drive", "Autumn Dr", "Bridge Road", "Bridge...
## $ StreetNum    <dbl> 406, 57, 31, 200, 395, 23, 18, 23, 497, 1086, 14, 21, ...
## $ WalkScore    <dbl> 9, 5, 46, 40, 32, 12, 82, 88, 15, 9, 35, 20, 68, 65, 6...
## $ Zip          <dbl> 1062, 1062, 1062, 1060, 1062, 1062, 1060, 1060, 1062, ...

Head of data frame

head(input_data1)
## # A tibble: 6 x 30
##   HouseNum  Acre AcreGroup Adj1998 Adj2007 Adj2011 BedGroup Bedrooms BikeScore
##      <dbl> <dbl> <chr>       <dbl>   <dbl>   <dbl> <chr>       <dbl>     <dbl>
## 1        1 0.28  > 1/4 ac~    148.    234.    192. 3 beds          3        35
## 2        2 0.290 > 1/4 ac~    135.    261.    207. 3 beds          3        44
## 3        3 0.36  > 1/4 ac~    257.    401.    348. 3 beds          3        66
## 4        4 0.26  > 1/4 ac~    232.    305.    257. 3 beds          3        61
## 5        5 0.31  > 1/4 ac~    272.    299.    237. 4+ beds         4        53
## 6        6 0.31  > 1/4 ac~    193.    276.    243. 3 beds          3        36
## # ... with 21 more variables: Diff2014 <dbl>, Distance <dbl>, DistGroup <chr>,
## #   GarageSpaces <dbl>, GarageGroup <chr>, Latitude <dbl>, Longitude <dbl>,
## #   NumFullBaths <dbl>, NumHalfBaths <dbl>, NumRooms <dbl>, PctChange <dbl>,
## #   Price1998 <dbl>, Price2007 <dbl>, Price2011 <dbl>, Price2014 <dbl>,
## #   SFGroup <chr>, SquareFeet <dbl>, StreetName <chr>, StreetNum <dbl>,
## #   WalkScore <dbl>, Zip <dbl>

Tail of data frame

tail(input_data1)
## # A tibble: 6 x 30
##   HouseNum  Acre AcreGroup Adj1998 Adj2007 Adj2011 BedGroup Bedrooms BikeScore
##      <dbl> <dbl> <chr>       <dbl>   <dbl>   <dbl> <chr>       <dbl>     <dbl>
## 1       99  0.19 <= 1/4 a~    428.    514.    406. 3 beds          3        90
## 2      100  0.13 <= 1/4 a~    219.    324.    336. 3 beds          3        73
## 3      101  0.46 > 1/4 ac~    324.    517.    506. 4+ beds         4        60
## 4      102  0.4  > 1/4 ac~    222.    329.    272. 4+ beds         4        78
## 5      103  0.2  <= 1/4 a~    219.    330.    265. 4+ beds         4        80
## 6      104  0.31 > 1/4 ac~    175.    273.    237. 1-2 beds        2        47
## # ... with 21 more variables: Diff2014 <dbl>, Distance <dbl>, DistGroup <chr>,
## #   GarageSpaces <dbl>, GarageGroup <chr>, Latitude <dbl>, Longitude <dbl>,
## #   NumFullBaths <dbl>, NumHalfBaths <dbl>, NumRooms <dbl>, PctChange <dbl>,
## #   Price1998 <dbl>, Price2007 <dbl>, Price2011 <dbl>, Price2014 <dbl>,
## #   SFGroup <chr>, SquareFeet <dbl>, StreetName <chr>, StreetNum <dbl>,
## #   WalkScore <dbl>, Zip <dbl>

Variable data types

sapply(input_data1,mode)
##     HouseNum         Acre    AcreGroup      Adj1998      Adj2007      Adj2011 
##    "numeric"    "numeric"  "character"    "numeric"    "numeric"    "numeric" 
##     BedGroup     Bedrooms    BikeScore     Diff2014     Distance    DistGroup 
##  "character"    "numeric"    "numeric"    "numeric"    "numeric"  "character" 
## GarageSpaces  GarageGroup     Latitude    Longitude NumFullBaths NumHalfBaths 
##    "numeric"  "character"    "numeric"    "numeric"    "numeric"    "numeric" 
##     NumRooms    PctChange    Price1998    Price2007    Price2011    Price2014 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
##      SFGroup   SquareFeet   StreetName    StreetNum    WalkScore          Zip 
##  "character"    "numeric"  "character"    "numeric"    "numeric"    "numeric"

Mean of each variable

lapply(input_data1[,num.names],mean)
## $HouseNum
## [1] 52.5
## 
## $Acre
## [1] 0.2574038
## 
## $Adj1998
## [1] 208.5663
## 
## $Adj2007
## [1] 327.5543
## 
## $Adj2011
## [1] 284.5266
## 
## $Bedrooms
## [1] 3.25
## 
## $BikeScore
## [1] 57.27885
## 
## $Diff2014
## [1] 84.52769
## 
## $Distance
## [1] 1.114324
## 
## $GarageSpaces
## [1] 0.7596154
## 
## $Latitude
## [1] 42.32622
## 
## $Longitude
## [1] -72.6623
## 
## $NumFullBaths
## [1] 1.451923
## 
## $NumHalfBaths
## [1] 0.2211538
## 
## $NumRooms
## [1] 6.615385
## 
## $PctChange
## [1] 42.2015
## 
## $Price1998
## [1] 142.6875
## 
## $Price2007
## [1] 285.0529
## 
## $Price2011
## [1] 268.624
## 
## $Price2014
## [1] 293.094
## 
## $SquareFeet
## [1] 1.566404
## 
## $StreetNum
## [1] 137.8365
## 
## $WalkScore
## [1] 38.875
## 
## $Zip
## [1] 1061.173

Median of each variable

lapply(input_data1[,num.names],median)
## $HouseNum
## [1] 52.5
## 
## $Acre
## [1] 0.25
## 
## $Adj1998
## [1] 200.6183
## 
## $Adj2007
## [1] 303.6497
## 
## $Adj2011
## [1] 258.8685
## 
## $Bedrooms
## [1] 3
## 
## $BikeScore
## [1] 54.5
## 
## $Diff2014
## [1] 71.38875
## 
## $Distance
## [1] 0.7604167
## 
## $GarageSpaces
## [1] 1
## 
## $Latitude
## [1] 42.32428
## 
## $Longitude
## [1] -72.66413
## 
## $NumFullBaths
## [1] 1
## 
## $NumHalfBaths
## [1] 0
## 
## $NumRooms
## [1] 6.5
## 
## $PctChange
## [1] 37.60517
## 
## $Price1998
## [1] 137.25
## 
## $Price2007
## [1] 264.25
## 
## $Price2011
## [1] 244.4
## 
## $Price2014
## [1] 272.92
## 
## $SquareFeet
## [1] 1.5155
## 
## $StreetNum
## [1] 63.5
## 
## $WalkScore
## [1] 36
## 
## $Zip
## [1] 1062

Minimum value of each variable

lapply(input_data1[,num.names],min)
## $HouseNum
## [1] 1
## 
## $Acre
## [1] 0.05
## 
## $Adj1998
## [1] 60.66055
## 
## $Adj2007
## [1] 162.5976
## 
## $Adj2011
## [1] 141.721
## 
## $Bedrooms
## [1] 1
## 
## $BikeScore
## [1] 18
## 
## $Diff2014
## [1] -199.8663
## 
## $Distance
## [1] 0.03882576
## 
## $GarageSpaces
## [1] 0
## 
## $Latitude
## [1] 42.29856
## 
## $Longitude
## [1] -72.7288
## 
## $NumFullBaths
## [1] 1
## 
## $NumHalfBaths
## [1] 0
## 
## $NumRooms
## [1] 4
## 
## $PctChange
## [1] -46.74717
## 
## $Price1998
## [1] 41.5
## 
## $Price2007
## [1] 141.5
## 
## $Price2011
## [1] 133.8
## 
## $Price2014
## [1] 132.135
## 
## $SquareFeet
## [1] 0.524
## 
## $StreetNum
## [1] 1
## 
## $WalkScore
## [1] 2
## 
## $Zip
## [1] 1060

Maximum value of each variable

lapply(input_data1[,num.names],max)
## $HouseNum
## [1] 104
## 
## $Acre
## [1] 0.56
## 
## $Adj1998
## [1] 470.6674
## 
## $Adj2007
## [1] 798.6245
## 
## $Adj2011
## [1] 698.5424
## 
## $Bedrooms
## [1] 6
## 
## $BikeScore
## [1] 97
## 
## $Diff2014
## [1] 497.8243
## 
## $Distance
## [1] 3.97678
## 
## $GarageSpaces
## [1] 4
## 
## $Latitude
## [1] 42.35441
## 
## $Longitude
## [1] -72.61442
## 
## $NumFullBaths
## [1] 4
## 
## $NumHalfBaths
## [1] 1
## 
## $NumRooms
## [1] 14
## 
## $PctChange
## [1] 130.49
## 
## $Price1998
## [1] 322
## 
## $Price2007
## [1] 695
## 
## $Price2011
## [1] 659.5
## 
## $Price2014
## [1] 879.328
## 
## $SquareFeet
## [1] 4.03
## 
## $StreetNum
## [1] 1086
## 
## $WalkScore
## [1] 94
## 
## $Zip
## [1] 1062

Range of each variable

lapply(input_data1[,num.names],range)
## $HouseNum
## [1]   1 104
## 
## $Acre
## [1] 0.05 0.56
## 
## $Adj1998
## [1]  60.66055 470.66740
## 
## $Adj2007
## [1] 162.5976 798.6245
## 
## $Adj2011
## [1] 141.7210 698.5424
## 
## $Bedrooms
## [1] 1 6
## 
## $BikeScore
## [1] 18 97
## 
## $Diff2014
## [1] -199.8663  497.8243
## 
## $Distance
## [1] 0.03882576 3.97678030
## 
## $GarageSpaces
## [1] 0 4
## 
## $Latitude
## [1] 42.29856 42.35441
## 
## $Longitude
## [1] -72.72880 -72.61442
## 
## $NumFullBaths
## [1] 1 4
## 
## $NumHalfBaths
## [1] 0 1
## 
## $NumRooms
## [1]  4 14
## 
## $PctChange
## [1] -46.74717 130.49003
## 
## $Price1998
## [1]  41.5 322.0
## 
## $Price2007
## [1] 141.5 695.0
## 
## $Price2011
## [1] 133.8 659.5
## 
## $Price2014
## [1] 132.135 879.328
## 
## $SquareFeet
## [1] 0.524 4.030
## 
## $StreetNum
## [1]    1 1086
## 
## $WalkScore
## [1]  2 94
## 
## $Zip
## [1] 1060 1062

Variance of each variable

lapply(input_data1[,num.names],var)
## $HouseNum
## [1] 910
## 
## $Acre
## [1] 0.01478057
## 
## $Adj1998
## [1] 4417.333
## 
## $Adj2007
## [1] 11021.41
## 
## $Adj2011
## [1] 8702.328
## 
## $Bedrooms
## [1] 0.8300971
## 
## $BikeScore
## [1] 514.106
## 
## $Diff2014
## [1] 5862.958
## 
## $Distance
## [1] 0.883221
## 
## $GarageSpaces
## [1] 0.7474795
## 
## $Latitude
## [1] 0.0001684309
## 
## $Longitude
## [1] 0.00055954
## 
## $NumFullBaths
## [1] 0.3860157
## 
## $NumHalfBaths
## [1] 0.1739171
## 
## $NumRooms
## [1] 2.782674
## 
## $PctChange
## [1] 912.0346
## 
## $Price1998
## [1] 2067.491
## 
## $Price2007
## [1] 8346.83
## 
## $Price2011
## [1] 7756.745
## 
## $Price2014
## [1] 12296.57
## 
## $SquareFeet
## [1] 0.3119305
## 
## $StreetNum
## [1] 40786.72
## 
## $WalkScore
## [1] 688.8483
## 
## $Zip
## [1] 0.9794623

Standard deviation of each variable

lapply(input_data1[,num.names],sd)
## $HouseNum
## [1] 30.16621
## 
## $Acre
## [1] 0.1215754
## 
## $Adj1998
## [1] 66.46302
## 
## $Adj2007
## [1] 104.9829
## 
## $Adj2011
## [1] 93.28627
## 
## $Bedrooms
## [1] 0.9110966
## 
## $BikeScore
## [1] 22.6739
## 
## $Diff2014
## [1] 76.56996
## 
## $Distance
## [1] 0.9397984
## 
## $GarageSpaces
## [1] 0.8645689
## 
## $Latitude
## [1] 0.01297809
## 
## $Longitude
## [1] 0.0236546
## 
## $NumFullBaths
## [1] 0.6213016
## 
## $NumHalfBaths
## [1] 0.4170337
## 
## $NumRooms
## [1] 1.668135
## 
## $PctChange
## [1] 30.19991
## 
## $Price1998
## [1] 45.46967
## 
## $Price2007
## [1] 91.36099
## 
## $Price2011
## [1] 88.07238
## 
## $Price2014
## [1] 110.8899
## 
## $SquareFeet
## [1] 0.5585074
## 
## $StreetNum
## [1] 201.9572
## 
## $WalkScore
## [1] 26.24592
## 
## $Zip
## [1] 0.9896779

Median absolute deviation of each variable

lapply(input_data1[,num.names],mad)
## $HouseNum
## [1] 38.5476
## 
## $Acre
## [1] 0.118608
## 
## $Adj1998
## [1] 48.21834
## 
## $Adj2007
## [1] 67.2944
## 
## $Adj2011
## [1] 68.54665
## 
## $Bedrooms
## [1] 0.7413
## 
## $BikeScore
## [1] 30.3933
## 
## $Diff2014
## [1] 45.16967
## 
## $Distance
## [1] 0.8901216
## 
## $GarageSpaces
## [1] 1.4826
## 
## $Latitude
## [1] 0.01319069
## 
## $Longitude
## [1] 0.03200859
## 
## $NumFullBaths
## [1] 0
## 
## $NumHalfBaths
## [1] 0
## 
## $NumRooms
## [1] 2.2239
## 
## $PctChange
## [1] 17.86248
## 
## $Price1998
## [1] 32.98785
## 
## $Price2007
## [1] 58.5627
## 
## $Price2011
## [1] 64.71549
## 
## $Price2014
## [1] 88.94785
## 
## $SquareFeet
## [1] 0.4655364
## 
## $StreetNum
## [1] 66.717
## 
## $WalkScore
## [1] 34.0998
## 
## $Zip
## [1] 0

Data frame plots before data processing

Observation

It appears that the variable values are on different scales.

Box plots

boxplot(input_data1[,num.names])

Histograms

input_data1[,num.names] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Data Processing

Sorting data set

Explanation

Useful for examinating the data values. By sorting the data, one can tell if there are missing or corrupted data values.

input_data1 <- input_data1[order(input_data1[,1]),] 

Removing data records with missing values

Explanation

These missing values could cause inaccuracies or errors when calculating data limits, central tendency, dispersion tendency, correlation, multicollinearity, p-values, z-scores, variance inflation factors, etc. Also, because cluster analysis involves the calculation of Euclidean distance it is important to remove these data records with missing values. A replacement of 0 for any of these missing values will introduce inaccuracy to the cluster analysis result.

# input_data1 <- input_data1%>% mutate_all(funs(replace_na(.,0)))
input_data1 <- na.omit(input_data1)

Standardization of numeric variables

Explanation

This is necessary for cluster analysis such as K-Means to correctly calculate the Euclidean distances between data points.

# m <- apply(input_data1[,num.names],2,mean)
# s <- apply(input_data1[,num.names],2,sd)
# input_data1_std <- as.data.frame(scale(input_data1[,num.names],m,s))

input_data1_std <- as.data.frame(lapply(input_data1, function(x) if(is.numeric(x)){
  (x-mean(x))/sd(x)
} else x))

Data frame plots after data processing

Observation

It appears that now all variable values are on the same scale.

Box plots

boxplot(input_data1_std[,num.names])

Histograms

input_data1_std[,num.names] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Descriptive statistics After Data Processing

Dimension of data frame

dim(input_data1_std)
## [1] 104  30

Structure of data frame

str(input_data1_std)
## 'data.frame':    104 obs. of  30 variables:
##  $ HouseNum    : num  -1.71 -1.67 -1.64 -1.61 -1.57 ...
##  $ Acre        : num  0.1859 0.2681 0.8439 0.0214 0.4326 ...
##  $ AcreGroup   : Factor w/ 2 levels "<= 1/4 acre",..: 2 2 2 2 2 2 1 1 2 2 ...
##  $ Adj1998     : num  -0.906 -1.104 0.722 0.348 0.953 ...
##  $ Adj2007     : num  -0.893 -0.63 0.7 -0.214 -0.274 ...
##  $ Adj2011     : num  -0.994 -0.831 0.68 -0.29 -0.513 ...
##  $ BedGroup    : Factor w/ 3 levels "1-2 beds","3 beds",..: 2 2 2 2 3 2 2 3 3 2 ...
##  $ Bedrooms    : num  -0.274 -0.274 -0.274 -0.274 0.823 ...
##  $ BikeScore   : num  -0.983 -0.586 0.385 0.164 -0.189 ...
##  $ Diff2014    : num  -0.2894 -0.2033 -0.0313 -0.5218 -2.4452 ...
##  $ Distance    : num  1.368 0.91 -1.14 -0.595 -0.551 ...
##  $ DistGroup   : Factor w/ 2 levels "Closer","Farther Away": 2 2 1 2 2 2 1 1 2 2 ...
##  $ GarageSpaces: num  1.435 0.278 1.435 0.278 -0.879 ...
##  $ GarageGroup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 1 1 1 ...
##  $ Latitude    : num  -0.839 -2.132 1.353 1.405 1.256 ...
##  $ Longitude   : num  -1.3389 -0.526 -0.7582 -0.4188 -0.0875 ...
##  $ NumFullBaths: num  -0.727 -0.727 0.882 -0.727 -0.727 ...
##  $ NumHalfBaths: num  -0.53 -0.53 1.87 1.87 -0.53 ...
##  $ NumRooms    : num  -0.968 -0.968 0.231 -0.369 -0.369 ...
##  $ PctChange   : num  -0.00546 0.29154 -0.33722 -0.76038 -2.64826 ...
##  $ Price1998   : num  -0.906 -1.104 0.722 0.348 0.953 ...
##  $ Price2007   : num  -0.893 -0.63 0.7 -0.214 -0.274 ...
##  $ Price2011   : num  -0.994 -0.831 0.68 -0.29 -0.513 ...
##  $ Price2014   : num  -0.743 -0.802 0.411 -0.152 -1.118 ...
##  $ SFGroup     : Factor w/ 2 levels "<= 1500 sf","> 1500 sf": 1 1 2 2 2 1 1 2 2 1 ...
##  $ SquareFeet  : num  -1.075 -1.0858 0.284 0.2875 0.0172 ...
##  $ StreetName  : Factor w/ 73 levels "Acrebrook Drive",..: 1 2 3 3 3 4 5 6 7 7 ...
##  $ StreetNum   : num  1.328 -0.4 -0.529 0.308 1.273 ...
##  $ WalkScore   : num  -1.1383 -1.2907 0.2715 0.0429 -0.2619 ...
##  $ Zip         : num  0.836 0.836 0.836 -1.185 0.836 ...

Summary statistics of data frame

summary(input_data1_std)
##     HouseNum            Acre               AcreGroup     Adj1998       
##  Min.   :-1.7072   Min.   :-1.7060   <= 1/4 acre:54   Min.   :-2.2254  
##  1st Qu.:-0.8536   1st Qu.:-0.7395   > 1/4 acre :50   1st Qu.:-0.6254  
##  Median : 0.0000   Median :-0.0609                    Median :-0.1196  
##  Mean   : 0.0000   Mean   : 0.0000                    Mean   : 0.0000  
##  3rd Qu.: 0.8536   3rd Qu.: 0.5971                    3rd Qu.: 0.2983  
##  Max.   : 1.7072   Max.   : 2.4890                    Max.   : 3.9436  
##                                                                        
##     Adj2007           Adj2011            BedGroup     Bedrooms      
##  Min.   :-1.5713   Min.   :-1.5308   1-2 beds:16   Min.   :-2.4696  
##  1st Qu.:-0.6382   1st Qu.:-0.7366   3 beds  :52   1st Qu.:-0.2744  
##  Median :-0.2277   Median :-0.2750   4+ beds :36   Median :-0.2744  
##  Mean   : 0.0000   Mean   : 0.0000                 Mean   : 0.0000  
##  3rd Qu.: 0.2088   3rd Qu.: 0.4349                 3rd Qu.: 0.8232  
##  Max.   : 4.4871   Max.   : 4.4381                 Max.   : 3.0183  
##                                                                     
##    BikeScore          Diff2014          Distance              DistGroup 
##  Min.   :-1.7323   Min.   :-3.7142   Min.   :-1.1444   Closer      :40  
##  1st Qu.:-0.9385   1st Qu.:-0.5249   1st Qu.:-0.8359   Farther Away:64  
##  Median :-0.1226   Median :-0.1716   Median :-0.3766                    
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000                    
##  3rd Qu.: 0.8808   3rd Qu.: 0.2918   3rd Qu.: 0.8315                    
##  Max.   : 1.7518   Max.   : 5.3976   Max.   : 3.0458                    
##                                                                         
##   GarageSpaces     GarageGroup    Latitude         Longitude      
##  Min.   :-0.8786   no :51      Min.   :-2.1317   Min.   :-2.8112  
##  1st Qu.:-0.8786   yes:53      1st Qu.:-0.6669   1st Qu.:-0.7140  
##  Median : 0.2780               Median :-0.1498   Median :-0.0775  
##  Mean   : 0.0000               Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2780               3rd Qu.: 0.5816   3rd Qu.: 0.8384  
##  Max.   : 3.7480               Max.   : 2.1718   Max.   : 2.0240  
##                                                                   
##   NumFullBaths      NumHalfBaths        NumRooms          PctChange      
##  Min.   :-0.7274   Min.   :-0.5303   Min.   :-1.56785   Min.   :-2.9453  
##  1st Qu.:-0.7274   1st Qu.:-0.5303   1st Qu.:-0.96838   1st Qu.:-0.5185  
##  Median :-0.7274   Median :-0.5303   Median :-0.06917   Median :-0.1522  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.8821   3rd Qu.:-0.5303   3rd Qu.: 0.38043   3rd Qu.: 0.2969  
##  Max.   : 4.1012   Max.   : 1.8676   Max.   : 4.42687   Max.   : 2.9235  
##                                                                          
##    Price1998         Price2007         Price2011         Price2014      
##  Min.   :-2.2254   Min.   :-1.5713   Min.   :-1.5308   Min.   :-1.4515  
##  1st Qu.:-0.6254   1st Qu.:-0.6382   1st Qu.:-0.7366   1st Qu.:-0.7228  
##  Median :-0.1196   Median :-0.2277   Median :-0.2750   Median :-0.1819  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2983   3rd Qu.: 0.2088   3rd Qu.: 0.4349   3rd Qu.: 0.3704  
##  Max.   : 3.9436   Max.   : 4.4871   Max.   : 4.4381   Max.   : 5.2866  
##                                                                         
##        SFGroup     SquareFeet                    StreetName   StreetNum       
##  <= 1500 sf:51   Min.   :-1.86641   Laurel Park       : 8   Min.   :-0.67755  
##  > 1500 sf :53   1st Qu.:-0.64440   Ryan Road         : 6   1st Qu.:-0.54881  
##                  Median :-0.09114   Bridge Road       : 3   Median :-0.36808  
##                  Mean   : 0.00000   Longview Drive    : 3   Mean   : 0.00000  
##                  3rd Qu.: 0.47510   North Maple Street: 3   3rd Qu.: 0.08499  
##                  Max.   : 4.41104   Burts Pit Rd      : 2   Max.   : 4.69487  
##                                     (Other)           :79                     
##    WalkScore            Zip         
##  Min.   :-1.4050   Min.   :-1.1853  
##  1st Qu.:-0.9192   1st Qu.:-1.1853  
##  Median :-0.1095   Median : 0.8355  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8335   3rd Qu.: 0.8355  
##  Max.   : 2.1003   Max.   : 0.8355  
## 

Glimpse of data frame

glimpse(input_data1_std)
## Rows: 104
## Columns: 30
## $ HouseNum     <dbl> -1.7072084, -1.6740587, -1.6409090, -1.6077593, -1.574...
## $ Acre         <dbl> 0.18586126, 0.26811476, 0.84388923, 0.02135427, 0.4326...
## $ AcreGroup    <fct> > 1/4 acre, > 1/4 acre, > 1/4 acre, > 1/4 acre, > 1/4 ...
## $ Adj1998      <dbl> -0.90582353, -1.10375765, 0.72163483, 0.34775926, 0.95...
## $ Adj2007      <dbl> -0.89264454, -0.62995035, 0.69993898, -0.21401788, -0....
## $ Adj2011      <dbl> -0.99377393, -0.83140748, 0.67984945, -0.28980751, -0....
## $ BedGroup     <fct> 3 beds, 3 beds, 3 beds, 3 beds, 4+ beds, 3 beds, 3 bed...
## $ Bedrooms     <dbl> -0.2743946, -0.2743946, -0.2743946, -0.2743946, 0.8231...
## $ BikeScore    <dbl> -0.9825765, -0.5856444, 0.3846340, 0.1641161, -0.18871...
## $ Diff2014     <dbl> -0.28942475, -0.20326433, -0.03126606, -0.52183837, -2...
## $ Distance     <dbl> 1.3680343, 0.9104893, -1.1395555, -0.5954349, -0.55089...
## $ DistGroup    <fct> Farther Away, Farther Away, Closer, Farther Away, Fart...
## $ GarageSpaces <dbl> 1.4346856, 0.2780398, 1.4346856, 0.2780398, -0.8786059...
## $ GarageGroup  <fct> yes, yes, yes, yes, no, yes, no, no, no, no, no, yes, ...
## $ Latitude     <dbl> -0.839007667, -2.131724363, 1.353455876, 1.405389538, ...
## $ Longitude    <dbl> -1.33892696, -0.52601946, -0.75815192, -0.41880986, -0...
## $ NumFullBaths <dbl> -0.7273812, -0.7273812, 0.8821431, -0.7273812, -0.7273...
## $ NumHalfBaths <dbl> -0.5303021, -0.5303021, 1.8675857, 1.8675857, -0.53030...
## $ NumRooms     <dbl> -0.9683778, -0.9683778, 0.2305661, -0.3689058, -0.3689...
## $ PctChange    <dbl> -0.005463079, 0.291539045, -0.337223680, -0.760382299,...
## $ Price1998    <dbl> -0.90582353, -1.10375765, 0.72163483, 0.34775926, 0.95...
## $ Price2007    <dbl> -0.89264454, -0.62995035, 0.69993898, -0.21401788, -0....
## $ Price2011    <dbl> -0.99377393, -0.83140748, 0.67984945, -0.28980751, -0....
## $ Price2014    <dbl> -0.74276372, -0.80190345, 0.41092996, -0.15189847, -1....
## $ SFGroup      <fct> <= 1500 sf, <= 1500 sf, > 1500 sf, > 1500 sf, > 1500 s...
## $ SquareFeet   <dbl> -1.07501499, -1.08575791, 0.28396428, 0.28754525, 0.01...
## $ StreetName   <fct> Acrebrook Drive, Autumn Dr, Bridge Road, Bridge Road, ...
## $ StreetNum    <dbl> 1.327823067, -0.400265643, -0.529005777, 0.307805089, ...
## $ WalkScore    <dbl> -1.13827217, -1.29067681, 0.27147077, 0.04286381, -0.2...
## $ Zip          <dbl> 0.8355477, 0.8355477, 0.8355477, -1.1853119, 0.8355477...

Head of data frame

head(input_data1_std)
##    HouseNum       Acre  AcreGroup    Adj1998    Adj2007    Adj2011 BedGroup
## 1 -1.707208 0.18586126 > 1/4 acre -0.9058235 -0.8926445 -0.9937739   3 beds
## 2 -1.674059 0.26811476 > 1/4 acre -1.1037577 -0.6299503 -0.8314075   3 beds
## 3 -1.640909 0.84388923 > 1/4 acre  0.7216348  0.6999390  0.6798494   3 beds
## 4 -1.607759 0.02135427 > 1/4 acre  0.3477593 -0.2140179 -0.2898075   3 beds
## 5 -1.574610 0.43262175 > 1/4 acre  0.9525580 -0.2742186 -0.5134872  4+ beds
## 6 -1.541460 0.43262175 > 1/4 acre -0.2350468 -0.4931305 -0.4419551   3 beds
##     Bedrooms  BikeScore    Diff2014   Distance    DistGroup GarageSpaces
## 1 -0.2743946 -0.9825765 -0.28942475  1.3680343 Farther Away    1.4346856
## 2 -0.2743946 -0.5856444 -0.20326433  0.9104893 Farther Away    0.2780398
## 3 -0.2743946  0.3846340 -0.03126606 -1.1395555       Closer    1.4346856
## 4 -0.2743946  0.1641161 -0.52183837 -0.5954349 Farther Away    0.2780398
## 5  0.8231838 -0.1887124 -2.44522656 -0.5508976 Farther Away   -0.8786059
## 6 -0.2743946 -0.9384729 -0.86176216  0.8147241 Farther Away    0.2780398
##   GarageGroup   Latitude   Longitude NumFullBaths NumHalfBaths   NumRooms
## 1         yes -0.8390077 -1.33892696   -0.7273812   -0.5303021 -0.9683778
## 2         yes -2.1317244 -0.52601946   -0.7273812   -0.5303021 -0.9683778
## 3         yes  1.3534559 -0.75815192    0.8821431    1.8675857  0.2305661
## 4         yes  1.4053895 -0.41880986   -0.7273812    1.8675857 -0.3689058
## 5          no  1.2563692 -0.08754234   -0.7273812   -0.5303021 -0.3689058
## 6         yes -0.5767966 -1.21176353   -0.7273812    1.8675857 -0.3689058
##      PctChange  Price1998  Price2007  Price2011  Price2014    SFGroup
## 1 -0.005463079 -0.9058235 -0.8926445 -0.9937739 -0.7427637 <= 1500 sf
## 2  0.291539045 -1.1037577 -0.6299503 -0.8314075 -0.8019034 <= 1500 sf
## 3 -0.337223680  0.7216348  0.6999390  0.6798494  0.4109300  > 1500 sf
## 4 -0.760382299  0.3477593 -0.2140179 -0.2898075 -0.1518985  > 1500 sf
## 5 -2.648260307  0.9525580 -0.2742186 -0.5134872 -1.1175137  > 1500 sf
## 6 -1.079180970 -0.2350468 -0.4931305 -0.4419551 -0.7359281 <= 1500 sf
##    SquareFeet      StreetName  StreetNum   WalkScore        Zip
## 1 -1.07501499 Acrebrook Drive  1.3278231 -1.13827217  0.8355477
## 2 -1.08575791       Autumn Dr -0.4002656 -1.29067681  0.8355477
## 3  0.28396428     Bridge Road -0.5290058  0.27147077  0.8355477
## 4  0.28754525     Bridge Road  0.3078051  0.04286381 -1.1853119
## 5  0.01718178     Bridge Road  1.2733561 -0.26194548  0.8355477
## 6 -0.44118276 Brierwood Drive -0.5686181 -1.02396869  0.8355477

Tail of data frame

tail(input_data1_std)
##     HouseNum       Acre   AcreGroup    Adj1998     Adj2007    Adj2011 BedGroup
## 99  1.541460 -0.5544202 <= 1/4 acre  3.2947784  1.77260692  1.3066066   3 beds
## 100 1.574610 -1.0479412 <= 1/4 acre  0.1498251 -0.03341563  0.5492750   3 beds
## 101 1.607759  1.6664242  > 1/4 acre  1.7332981  1.79997090  2.3784523  4+ beds
## 102 1.640909  1.1729032  > 1/4 acre  0.2048068  0.01583953 -0.1331182  4+ beds
## 103 1.674059 -0.4721667 <= 1/4 acre  0.1498251  0.02678512 -0.2125983  4+ beds
## 104 1.707208  0.4326217  > 1/4 acre -0.4989589 -0.52049444 -0.5089454 1-2 beds
##       Bedrooms  BikeScore    Diff2014    Distance    DistGroup GarageSpaces
## 99  -0.2743946  1.4431195 -0.57743197 -0.46766734 Farther Away   -0.8786059
## 100 -0.2743946  0.6933589 -0.01452059 -0.17424528 Farther Away   -0.8786059
## 101  0.8231838  0.1200126  1.65300811 -0.37254700 Farther Away    1.4346856
## 102  0.8231838  0.9138767  0.32824766 -1.05249620       Closer   -0.8786059
## 103  0.8231838  1.0020839  0.23185542 -0.04688075 Farther Away   -0.8786059
## 104 -1.3719730 -0.4533337 -1.08958781 -0.41607664 Farther Away   -0.8786059
##     GarageGroup    Latitude  Longitude NumFullBaths NumHalfBaths   NumRooms
## 99           no -0.24084592  1.4542721    0.8821431    1.8675857  1.4295100
## 100          no -0.90488791  1.7001439    0.8821431   -0.5303021  0.8300381
## 101         yes -0.26234368  0.5264609    0.8821431    1.8675857  1.4295100
## 102          no  1.05556932 -0.4909734    0.8821431   -0.5303021  0.8300381
## 103          no -0.64491138  1.6585030   -0.7273812   -0.5303021  0.8300381
## 104          no  0.05904402  0.1423073   -0.7273812   -0.5303021 -0.9683778
##      PctChange  Price1998   Price2007  Price2011   Price2014    SFGroup
## 99  -1.0851829  3.2947784  1.77260692  1.3066066  1.57604036  > 1500 sf
## 100 -0.1334153  0.1498251 -0.03341563  0.5492750  0.07977272  > 1500 sf
## 101  0.7615691  1.7332981  1.79997090  2.3784523  2.18027922  > 1500 sf
## 102  0.2369527  0.2048068  0.01583953 -0.1331182  0.34940949  > 1500 sf
## 103  0.1524436  0.1498251  0.02678512 -0.2125983  0.24989639  > 1500 sf
## 104 -1.3766769 -0.4989589 -0.52049444 -0.5089454 -1.05142116 <= 1500 sf
##     SquareFeet      StreetName  StreetNum   WalkScore        Zip
## 99   0.9589777    Union Street -0.4200718  1.90982067 -1.1853119
## 100  0.5238895   Valley Street -0.4794903  0.91919050 -1.1853119
## 101  1.7217249   Vernon Street -0.4745388  0.23336961 -1.1853119
## 102  0.7047286   Warren Street -0.5983274  0.99539282  0.8355477
## 103  0.6707094 Williams Street -0.4943450  1.56691023 -1.1853119
## 104 -0.6614126     Winslow Ave -0.6775521 -0.03333852  0.8355477

Variable data types

sapply(input_data1_std,mode)
##     HouseNum         Acre    AcreGroup      Adj1998      Adj2007      Adj2011 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
##     BedGroup     Bedrooms    BikeScore     Diff2014     Distance    DistGroup 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
## GarageSpaces  GarageGroup     Latitude    Longitude NumFullBaths NumHalfBaths 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
##     NumRooms    PctChange    Price1998    Price2007    Price2011    Price2014 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
##      SFGroup   SquareFeet   StreetName    StreetNum    WalkScore          Zip 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"

Mean of each variable

lapply(input_data1_std[,num.names],mean)
## $HouseNum
## [1] 0
## 
## $Acre
## [1] 9.22656e-17
## 
## $Adj1998
## [1] 1.210022e-16
## 
## $Adj2007
## [1] 2.232273e-16
## 
## $Adj2011
## [1] 9.771841e-17
## 
## $Bedrooms
## [1] 1.810305e-17
## 
## $BikeScore
## [1] 2.091572e-17
## 
## $Diff2014
## [1] 4.066279e-17
## 
## $Distance
## [1] -1.13321e-16
## 
## $GarageSpaces
## [1] 7.512896e-17
## 
## $Latitude
## [1] -1.052803e-13
## 
## $Longitude
## [1] -2.715005e-13
## 
## $NumFullBaths
## [1] 9.598525e-17
## 
## $NumHalfBaths
## [1] -1.70814e-17
## 
## $NumRooms
## [1] 2.433575e-16
## 
## $PctChange
## [1] -1.196469e-16
## 
## $Price1998
## [1] -4.682138e-18
## 
## $Price2007
## [1] -2.813041e-16
## 
## $Price2011
## [1] -1.470246e-16
## 
## $Price2014
## [1] 9.333938e-17
## 
## $SquareFeet
## [1] 1.158512e-16
## 
## $StreetNum
## [1] 4.123617e-17
## 
## $WalkScore
## [1] 2.891917e-17
## 
## $Zip
## [1] 1.761201e-14

Median of each variable

lapply(input_data1_std[,num.names],median)
## $HouseNum
## [1] 0
## 
## $Acre
## [1] -0.06089922
## 
## $Adj1998
## [1] -0.1195852
## 
## $Adj2007
## [1] -0.2276999
## 
## $Adj2011
## [1] -0.2750469
## 
## $Bedrooms
## [1] -0.2743946
## 
## $BikeScore
## [1] -0.122557
## 
## $Diff2014
## [1] -0.1715939
## 
## $Distance
## [1] -0.3765775
## 
## $GarageSpaces
## [1] 0.2780398
## 
## $Latitude
## [1] -0.1498079
## 
## $Longitude
## [1] -0.07750201
## 
## $NumFullBaths
## [1] -0.7273812
## 
## $NumHalfBaths
## [1] -0.5303021
## 
## $NumRooms
## [1] -0.06916984
## 
## $PctChange
## [1] -0.1521968
## 
## $Price1998
## [1] -0.1195852
## 
## $Price2007
## [1] -0.2276999
## 
## $Price2011
## [1] -0.2750469
## 
## $Price2014
## [1] -0.1819283
## 
## $SquareFeet
## [1] -0.09114265
## 
## $StreetNum
## [1] -0.3680806
## 
## $WalkScore
## [1] -0.1095408
## 
## $Zip
## [1] 0.8355477

Minimum value of each variable

lapply(input_data1_std[,num.names],min)
## $HouseNum
## [1] -1.707208
## 
## $Acre
## [1] -1.705969
## 
## $Adj1998
## [1] -2.225384
## 
## $Adj2007
## [1] -1.571271
## 
## $Adj2011
## [1] -1.530832
## 
## $Bedrooms
## [1] -2.469551
## 
## $BikeScore
## [1] -1.732337
## 
## $Diff2014
## [1] -3.714171
## 
## $Distance
## [1] -1.144392
## 
## $GarageSpaces
## [1] -0.8786059
## 
## $Latitude
## [1] -2.131724
## 
## $Longitude
## [1] -2.811241
## 
## $NumFullBaths
## [1] -0.7273812
## 
## $NumHalfBaths
## [1] -0.5303021
## 
## $NumRooms
## [1] -1.56785
## 
## $PctChange
## [1] -2.945329
## 
## $Price1998
## [1] -2.225384
## 
## $Price2007
## [1] -1.571271
## 
## $Price2011
## [1] -1.530832
## 
## $Price2014
## [1] -1.451521
## 
## $SquareFeet
## [1] -1.86641
## 
## $StreetNum
## [1] -0.6775521
## 
## $WalkScore
## [1] -1.40498
## 
## $Zip
## [1] -1.185312

Maximum value of each variable

lapply(input_data1_std[,num.names],max)
## $HouseNum
## [1] 1.707208
## 
## $Acre
## [1] 2.488959
## 
## $Adj1998
## [1] 3.943563
## 
## $Adj2007
## [1] 4.487114
## 
## $Adj2011
## [1] 4.438122
## 
## $Bedrooms
## [1] 3.018341
## 
## $BikeScore
## [1] 1.751844
## 
## $Diff2014
## [1] 5.397634
## 
## $Distance
## [1] 3.04582
## 
## $GarageSpaces
## [1] 3.747977
## 
## $Latitude
## [1] 2.171758
## 
## $Longitude
## [1] 2.023971
## 
## $NumFullBaths
## [1] 4.101192
## 
## $NumHalfBaths
## [1] 1.867586
## 
## $NumRooms
## [1] 4.42687
## 
## $PctChange
## [1] 2.92347
## 
## $Price1998
## [1] 3.943563
## 
## $Price2007
## [1] 4.487114
## 
## $Price2011
## [1] 4.438122
## 
## $Price2014
## [1] 5.28663
## 
## $SquareFeet
## [1] 4.411036
## 
## $StreetNum
## [1] 4.694873
## 
## $WalkScore
## [1] 2.100326
## 
## $Zip
## [1] 0.8355477

Range of each variable

lapply(input_data1_std[,num.names],range)
## $HouseNum
## [1] -1.707208  1.707208
## 
## $Acre
## [1] -1.705969  2.488959
## 
## $Adj1998
## [1] -2.225384  3.943563
## 
## $Adj2007
## [1] -1.571271  4.487114
## 
## $Adj2011
## [1] -1.530832  4.438122
## 
## $Bedrooms
## [1] -2.469551  3.018341
## 
## $BikeScore
## [1] -1.732337  1.751844
## 
## $Diff2014
## [1] -3.714171  5.397634
## 
## $Distance
## [1] -1.144392  3.045820
## 
## $GarageSpaces
## [1] -0.8786059  3.7479771
## 
## $Latitude
## [1] -2.131724  2.171758
## 
## $Longitude
## [1] -2.811241  2.023971
## 
## $NumFullBaths
## [1] -0.7273812  4.1011916
## 
## $NumHalfBaths
## [1] -0.5303021  1.8675857
## 
## $NumRooms
## [1] -1.56785  4.42687
## 
## $PctChange
## [1] -2.945329  2.923470
## 
## $Price1998
## [1] -2.225384  3.943563
## 
## $Price2007
## [1] -1.571271  4.487114
## 
## $Price2011
## [1] -1.530832  4.438122
## 
## $Price2014
## [1] -1.451521  5.286630
## 
## $SquareFeet
## [1] -1.866410  4.411036
## 
## $StreetNum
## [1] -0.6775521  4.6948727
## 
## $WalkScore
## [1] -1.404980  2.100326
## 
## $Zip
## [1] -1.1853119  0.8355477

Variance of each variable

lapply(input_data1_std[,num.names],var)
## $HouseNum
## [1] 1
## 
## $Acre
## [1] 1
## 
## $Adj1998
## [1] 1
## 
## $Adj2007
## [1] 1
## 
## $Adj2011
## [1] 1
## 
## $Bedrooms
## [1] 1
## 
## $BikeScore
## [1] 1
## 
## $Diff2014
## [1] 1
## 
## $Distance
## [1] 1
## 
## $GarageSpaces
## [1] 1
## 
## $Latitude
## [1] 1
## 
## $Longitude
## [1] 1
## 
## $NumFullBaths
## [1] 1
## 
## $NumHalfBaths
## [1] 1
## 
## $NumRooms
## [1] 1
## 
## $PctChange
## [1] 1
## 
## $Price1998
## [1] 1
## 
## $Price2007
## [1] 1
## 
## $Price2011
## [1] 1
## 
## $Price2014
## [1] 1
## 
## $SquareFeet
## [1] 1
## 
## $StreetNum
## [1] 1
## 
## $WalkScore
## [1] 1
## 
## $Zip
## [1] 1

Standard deviation of each variable

lapply(input_data1_std[,num.names],sd)
## $HouseNum
## [1] 1
## 
## $Acre
## [1] 1
## 
## $Adj1998
## [1] 1
## 
## $Adj2007
## [1] 1
## 
## $Adj2011
## [1] 1
## 
## $Bedrooms
## [1] 1
## 
## $BikeScore
## [1] 1
## 
## $Diff2014
## [1] 1
## 
## $Distance
## [1] 1
## 
## $GarageSpaces
## [1] 1
## 
## $Latitude
## [1] 1
## 
## $Longitude
## [1] 1
## 
## $NumFullBaths
## [1] 1
## 
## $NumHalfBaths
## [1] 1
## 
## $NumRooms
## [1] 1
## 
## $PctChange
## [1] 1
## 
## $Price1998
## [1] 1
## 
## $Price2007
## [1] 1
## 
## $Price2011
## [1] 1
## 
## $Price2014
## [1] 1
## 
## $SquareFeet
## [1] 1
## 
## $StreetNum
## [1] 1
## 
## $WalkScore
## [1] 1
## 
## $Zip
## [1] 1

Median absolute deviation of each variable

lapply(input_data1_std[,num.names],mad)
## $HouseNum
## [1] 1.27784
## 
## $Acre
## [1] 0.9755923
## 
## $Adj1998
## [1] 0.7254912
## 
## $Adj2007
## [1] 0.6410034
## 
## $Adj2011
## [1] 0.7347989
## 
## $Bedrooms
## [1] 0.8136349
## 
## $BikeScore
## [1] 1.340453
## 
## $Diff2014
## [1] 0.5899137
## 
## $Distance
## [1] 0.947141
## 
## $GarageSpaces
## [1] 1.714843
## 
## $Latitude
## [1] 1.016381
## 
## $Longitude
## [1] 1.353166
## 
## $NumFullBaths
## [1] 0
## 
## $NumHalfBaths
## [1] 0
## 
## $NumRooms
## [1] 1.333166
## 
## $PctChange
## [1] 0.5914746
## 
## $Price1998
## [1] 0.7254912
## 
## $Price2007
## [1] 0.6410034
## 
## $Price2011
## [1] 0.7347989
## 
## $Price2014
## [1] 0.8021274
## 
## $SquareFeet
## [1] 0.8335366
## 
## $StreetNum
## [1] 0.3303521
## 
## $WalkScore
## [1] 1.299242
## 
## $Zip
## [1] 0

Principal Component Analysis

Explanation

The principal component analysis is a technique to reduce the number of dimensions of a data set while keeping the integrity of the data intact. When performing cluster analysis on a large multidimensional data set with many data fields, it is necessary to combine each group of highly correlated data fields into a new data field. The new data field is called a principal component. This technique will not only make it possible to carry out the cluster analysis on large data sets, but it also helps to exclude useless data fields that do not contribute much to the variation in the data.

Interpretation

Out of the 24 numeric variables, except variable HouseNum, 23 variables have absolute correlation coefficient values larger than 0.5. As a result of the calculation of the principal components, each of the first two principal components can capture either 13.8% or 44% of the variance in the data. If both principal components are used, the total accumulative percentage of variance captured is about 57.8%. To capture about 99% of the variance in the data, a minimum of 16 principal components are required. This is a better option when compared to the original data dimension of 24 variables.

Check principal component analysis eligibility

Explanation

This is to check if it is neccessary to carry out principal component analysis.

Observation

Since out of the 24 numeric variables, except variable HouseNum, 23 variables have absolute correlation coefficient values larger than 0.5, then it is necessary to carry out the principal component analysis. It because with a significant cut-off point of 0.5, it is evidenced that there exist many highly correlated variables, and they should be combined to form principal components.

Cumulative proportion variance

Explanation

This is the accumulation of variance covered by each principal component.

Observation

For this analysis, as one could see from all the calculations, Cumulative Proportion Variance plot, and the Scree plot below, that the amount of variance captured by each principal component becomes less after each principal component. Also, one could see that the first 16 principal components are required to reach a Cumulative Proportion Variance of 99%. Therefore, any principal component beyond the 16th one could be excluded from the cluster analysis.

Orthogonality of principal components

Explanation

This is to validate the resulting principal component analysis.

Observation

For this analysis, as one could see that both the calculation and the plotting of the correlation between the principal component scores below show that all the principal components are independent of each other. If one examines the plot below one could see that about 99.9% of the correlation coefficient values are smaller than the significant cut-off point of 0.5.

Biplot examination

Explanation

This is a plot of the first two principal components. The purpose of this plot is to reveal three properties. The first property is the degree of correlation between variables. The second property is the direction of the correlation between variables. The last property is the group of data points that contribute to the correlation between the two particular variables.

Observation

If one carefully observes the Biplot, one would see that the group of variables Adj1988, Adj2007, Adj2011, Price1998, Price2007, Price2011, Price2014, SquareFeet, HouseNum, Bedrooms, and NumRooms are highly correlated with each other. Also, one could see that the direction of variable StreetNum is perpendicular to all these variables, which means that variable StreetNum has very weak correlations with all these variables.

Correlation plot

pairs.panels(input_data1_std[,num.names],gap=0,bg=c("green","red","yellow","blue","pink","purple"),pch=21)

Correlation table

oldw <- getOption("warn")
options(warn = -1)
cor(input_data1_std[,num.names])
##                  HouseNum        Acre     Adj1998     Adj2007     Adj2011
## HouseNum      1.000000000  0.07336878  0.19248873  0.18032495  0.18416442
## Acre          0.073368784  1.00000000  0.10107537  0.00551488 -0.03391574
## Adj1998       0.192488733  0.10107537  1.00000000  0.86717492  0.78736808
## Adj2007       0.180324953  0.00551488  0.86717492  1.00000000  0.93829759
## Adj2011       0.184164416 -0.03391574  0.78736808  0.93829759  1.00000000
## Bedrooms      0.002119478  0.05675348  0.59939534  0.54576063  0.56610895
## BikeScore     0.192127736 -0.26596371  0.46062119  0.52293803  0.52643817
## Diff2014      0.154133178 -0.18395645  0.19809934  0.54403362  0.70485942
## Distance     -0.124392026  0.17498754 -0.44455151 -0.48722784 -0.47000388
## GarageSpaces  0.213116923  0.17966297  0.28603566  0.35415565  0.33126795
## Latitude     -0.023278745 -0.24923633 -0.11263415 -0.06399535 -0.09638797
## Longitude     0.111582123 -0.47870730  0.26347048  0.39607791  0.39718795
## NumFullBaths  0.197103468  0.02853680  0.50697400  0.55930563  0.57901110
## NumHalfBaths  0.023538066  0.13207283  0.29091189  0.29311349  0.22887223
## NumRooms      0.183288217  0.10130555  0.73785572  0.76309302  0.73376688
## PctChange     0.027700561 -0.33540669 -0.17558573  0.19012993  0.35527952
## Price1998     0.192488733  0.10107537  1.00000000  0.86717492  0.78736808
## Price2007     0.180324953  0.00551488  0.86717492  1.00000000  0.93829759
## Price2011     0.184164416 -0.03391574  0.78736808  0.93829759  1.00000000
## Price2014     0.221799722 -0.06644213  0.73614877  0.89540777  0.95862560
## SquareFeet    0.191231258  0.05255086  0.80756328  0.86463690  0.82095325
## StreetNum     0.040891296  0.24189606 -0.02145054 -0.14495988 -0.19798526
## WalkScore     0.188996331 -0.23627500  0.45541129  0.50095050  0.51908938
## Zip          -0.073169610  0.53390839 -0.20773850 -0.37403992 -0.37594128
##                  Bedrooms   BikeScore    Diff2014    Distance GarageSpaces
## HouseNum      0.002119478  0.19212774  0.15413318 -0.12439203  0.213116923
## Acre          0.056753476 -0.26596371 -0.18395645  0.17498754  0.179662972
## Adj1998       0.599395342  0.46062119  0.19809934 -0.44455151  0.286035661
## Adj2007       0.545760627  0.52293803  0.54403362 -0.48722784  0.354155654
## Adj2011       0.566108952  0.52643817  0.70485942 -0.47000388  0.331267947
## Bedrooms      1.000000000  0.23580847  0.29051333 -0.22846934  0.237262680
## BikeScore     0.235808466  1.00000000  0.33876338 -0.83567831  0.177785516
## Diff2014      0.290513326  0.33876338  1.00000000 -0.26475678  0.251829614
## Distance     -0.228469345 -0.83567831 -0.26475678  1.00000000 -0.194933559
## GarageSpaces  0.237262680  0.17778552  0.25182961 -0.19493356  1.000000000
## Latitude     -0.228257754  0.07905115  0.02485013 -0.32113258 -0.053388067
## Longitude     0.062023009  0.63782184  0.32329321 -0.56926915 -0.120406744
## NumFullBaths  0.484522900  0.21701946  0.35314835 -0.19122431  0.186130181
## NumHalfBaths -0.019164101  0.11970539  0.05902457 -0.17158509  0.175803871
## NumRooms      0.741011901  0.45412031  0.40171804 -0.37041186  0.258398031
## PctChange     0.019829923  0.11974807  0.83393325 -0.04082699  0.095239855
## Price1998     0.599395342  0.46062119  0.19809934 -0.44455151  0.286035661
## Price2007     0.545760627  0.52293803  0.54403362 -0.48722784  0.354155654
## Price2011     0.566108952  0.52643817  0.70485942 -0.47000388  0.331267947
## Price2014     0.559854474  0.50999562  0.80923709 -0.44926222  0.345327841
## SquareFeet    0.703035755  0.45082496  0.46002301 -0.42342170  0.394147584
## StreetNum    -0.082720994 -0.22639775 -0.26571409  0.34578751  0.009781441
## WalkScore     0.199452323  0.92889448  0.35060372 -0.76072319  0.109478592
## Zip          -0.026918105 -0.48371758 -0.32438208  0.30701353  0.128523113
##                 Latitude   Longitude NumFullBaths NumHalfBaths    NumRooms
## HouseNum     -0.02327875  0.11158212   0.19710347  0.023538066  0.18328822
## Acre         -0.24923633 -0.47870730   0.02853680  0.132072830  0.10130555
## Adj1998      -0.11263415  0.26347048   0.50697400  0.290911889  0.73785572
## Adj2007      -0.06399535  0.39607791   0.55930563  0.293113487  0.76309302
## Adj2011      -0.09638797  0.39718795   0.57901110  0.228872226  0.73376688
## Bedrooms     -0.22825775  0.06202301   0.48452290 -0.019164101  0.74101190
## BikeScore     0.07905115  0.63782184   0.21701946  0.119705394  0.45412031
## Diff2014      0.02485013  0.32329321   0.35314835  0.059024571  0.40171804
## Distance     -0.32113258 -0.56926915  -0.19122431 -0.171585094 -0.37041186
## GarageSpaces -0.05338807 -0.12040674   0.18613018  0.175803871  0.25839803
## Latitude      1.00000000  0.23539200  -0.05362645  0.025661444 -0.16304912
## Longitude     0.23539200  1.00000000   0.13460933 -0.024741303  0.23273064
## NumFullBaths -0.05362645  0.13460933   1.00000000 -0.164653898  0.48783671
## NumHalfBaths  0.02566144 -0.02474130  -0.16465390  1.000000000  0.19323671
## NumRooms     -0.16304912  0.23273064   0.48783671  0.193236715  1.00000000
## PctChange     0.14492862  0.31729342   0.16777199 -0.049495866  0.09750167
## Price1998    -0.11263415  0.26347048   0.50697400  0.290911889  0.73785572
## Price2007    -0.06399535  0.39607791   0.55930563  0.293113487  0.76309302
## Price2011    -0.09638797  0.39718795   0.57901110  0.228872226  0.73376688
## Price2014    -0.05034933  0.38114908   0.54771054  0.215117766  0.71962945
## SquareFeet   -0.12599754  0.28037440   0.56366439  0.272513762  0.85579218
## StreetNum    -0.31431361 -0.35401400  -0.13566300 -0.004523405 -0.13295553
## WalkScore     0.09457122  0.59410185   0.16901516  0.115200773  0.46191186
## Zip          -0.11963301 -0.82468131  -0.08107265 -0.023070904 -0.19451938
##                PctChange   Price1998   Price2007   Price2011   Price2014
## HouseNum      0.02770056  0.19248873  0.18032495  0.18416442  0.22179972
## Acre         -0.33540669  0.10107537  0.00551488 -0.03391574 -0.06644213
## Adj1998      -0.17558573  1.00000000  0.86717492  0.78736808  0.73614877
## Adj2007       0.19012993  0.86717492  1.00000000  0.93829759  0.89540777
## Adj2011       0.35527952  0.78736808  0.93829759  1.00000000  0.95862560
## Bedrooms      0.01982992  0.59939534  0.54576063  0.56610895  0.55985447
## BikeScore     0.11974807  0.46062119  0.52293803  0.52643817  0.50999562
## Diff2014      0.83393325  0.19809934  0.54403362  0.70485942  0.80923709
## Distance     -0.04082699 -0.44455151 -0.48722784 -0.47000388 -0.44926222
## GarageSpaces  0.09523986  0.28603566  0.35415565  0.33126795  0.34532784
## Latitude      0.14492862 -0.11263415 -0.06399535 -0.09638797 -0.05034933
## Longitude     0.31729342  0.26347048  0.39607791  0.39718795  0.38114908
## NumFullBaths  0.16777199  0.50697400  0.55930563  0.57901110  0.54771054
## NumHalfBaths -0.04949587  0.29091189  0.29311349  0.22887223  0.21511777
## NumRooms      0.09750167  0.73785572  0.76309302  0.73376688  0.71962945
## PctChange     1.00000000 -0.17558573  0.19012993  0.35527952  0.47059529
## Price1998    -0.17558573  1.00000000  0.86717492  0.78736808  0.73614877
## Price2007     0.19012993  0.86717492  1.00000000  0.93829759  0.89540777
## Price2011     0.35527952  0.78736808  0.93829759  1.00000000  0.95862560
## Price2014     0.47059529  0.73614877  0.89540777  0.95862560  1.00000000
## SquareFeet    0.14466382  0.80756328  0.86463690  0.82095325  0.80166923
## StreetNum    -0.25836727 -0.02145054 -0.14495988 -0.19798526 -0.19633330
## WalkScore     0.11671121  0.45541129  0.50095050  0.51908938  0.51504881
## Zip          -0.34790929 -0.20773850 -0.37403992 -0.37594128 -0.34849741
##               SquareFeet    StreetNum   WalkScore         Zip
## HouseNum      0.19123126  0.040891296  0.18899633 -0.07316961
## Acre          0.05255086  0.241896055 -0.23627500  0.53390839
## Adj1998       0.80756328 -0.021450537  0.45541129 -0.20773850
## Adj2007       0.86463690 -0.144959878  0.50095050 -0.37403992
## Adj2011       0.82095325 -0.197985256  0.51908938 -0.37594128
## Bedrooms      0.70303576 -0.082720994  0.19945232 -0.02691810
## BikeScore     0.45082496 -0.226397748  0.92889448 -0.48371758
## Diff2014      0.46002301 -0.265714093  0.35060372 -0.32438208
## Distance     -0.42342170  0.345787506 -0.76072319  0.30701353
## GarageSpaces  0.39414758  0.009781441  0.10947859  0.12852311
## Latitude     -0.12599754 -0.314313614  0.09457122 -0.11963301
## Longitude     0.28037440 -0.354014003  0.59410185 -0.82468131
## NumFullBaths  0.56366439 -0.135663003  0.16901516 -0.08107265
## NumHalfBaths  0.27251376 -0.004523405  0.11520077 -0.02307090
## NumRooms      0.85579218 -0.132955525  0.46191186 -0.19451938
## PctChange     0.14466382 -0.258367268  0.11671121 -0.34790929
## Price1998     0.80756328 -0.021450537  0.45541129 -0.20773850
## Price2007     0.86463690 -0.144959878  0.50095050 -0.37403992
## Price2011     0.82095325 -0.197985256  0.51908938 -0.37594128
## Price2014     0.80166923 -0.196333299  0.51504881 -0.34849741
## SquareFeet    1.00000000 -0.114888448  0.42862202 -0.25225094
## StreetNum    -0.11488845  1.000000000 -0.27798184  0.21168545
## WalkScore     0.42862202 -0.277981844  1.00000000 -0.41890531
## Zip          -0.25225094  0.211685446 -0.41890531  1.00000000
options(warn = oldw)

Principal component object creation

oldw <- getOption("warn")
options(warn = -1)
pcaobj <- princomp(input_data1_std[,num.names], center=TRUE, scale.=TRUE) #That data has already been standardized. So no need to standardize again.
options(warn = oldw)

Attributes of principal component object

oldw <- getOption("warn")
options(warn = -1)
attributes(pcaobj)
## $names
## [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"    
## 
## $class
## [1] "princomp"
options(warn = oldw)

Principal component object statistics

oldw <- getOption("warn")
options(warn = -1)
print(pcaobj)
## Call:
## princomp(x = input_data1_std[, num.names], center = TRUE, scale. = TRUE)
## 
## Standard deviations:
##       Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6 
## 3.232928e+00 1.811809e+00 1.422716e+00 1.153309e+00 1.067574e+00 1.054650e+00 
##       Comp.7       Comp.8       Comp.9      Comp.10      Comp.11      Comp.12 
## 9.459141e-01 8.327146e-01 7.980660e-01 7.335635e-01 6.737290e-01 5.967567e-01 
##      Comp.13      Comp.14      Comp.15      Comp.16      Comp.17      Comp.18 
## 4.917841e-01 3.990183e-01 3.630865e-01 3.077534e-01 2.825305e-01 2.523302e-01 
##      Comp.19      Comp.20      Comp.21      Comp.22      Comp.23      Comp.24 
## 2.271520e-01 1.805307e-01 7.760413e-09 0.000000e+00 0.000000e+00 0.000000e+00 
## 
##  24  variables and  104 observations.
summary(pcaobj)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     3.2329279 1.8118087 1.42271572 1.15330884 1.06757401
## Proportion of Variance 0.4397207 0.1381050 0.08515715 0.05595979 0.04794914
## Cumulative Proportion  0.4397207 0.5778257 0.66298290 0.71894269 0.76689184
##                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     1.0546497 0.94591409 0.83271459 0.79806602 0.73356348
## Proportion of Variance 0.0467952 0.03764335 0.02917274 0.02679554 0.02263916
## Cumulative Proportion  0.8136870 0.85133039 0.88050313 0.90729867 0.92993782
##                           Comp.11    Comp.12    Comp.13     Comp.14     Comp.15
## Standard deviation     0.67372901 0.59675672 0.49178414 0.399018333 0.363086463
## Proportion of Variance 0.01909657 0.01498233 0.01017499 0.006698392 0.005546321
## Cumulative Proportion  0.94903439 0.96401673 0.97419172 0.980890110 0.986436431
##                            Comp.16    Comp.17     Comp.18     Comp.19
## Standard deviation     0.307753441 0.28253054 0.252330179 0.227152046
## Proportion of Variance 0.003984655 0.00335827 0.002678695 0.002170792
## Cumulative Proportion  0.990421086 0.99377936 0.996458051 0.998628843
##                            Comp.20      Comp.21 Comp.22 Comp.23 Comp.24
## Standard deviation     0.180530725 7.760413e-09       0       0       0
## Proportion of Variance 0.001371157 2.533696e-18       0       0       0
## Cumulative Proportion  1.000000000 1.000000e+00       1       1       1
head(pcaobj$scores)
##          Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6
## [1,] -3.6677181  1.4532443 -1.1679153  0.1884363 -0.4415176 -0.1453377
## [2,] -3.2233429  0.6249662 -1.2500736 -0.4668716 -0.3732941 -0.4139261
## [3,]  1.8075818  0.9888193  0.8141262  2.1073792  0.5476904 -2.3057125
## [4,] -0.2600085 -0.3387117  1.6001479  0.9244271 -1.1776550 -2.0028545
## [5,] -1.5245964  1.6160933  3.4805473 -1.4802967  0.4779060 -1.5290237
## [6,] -2.2875418  1.7060820  0.4564077  0.7699515 -0.9030274 -2.0447949
##           Comp.7      Comp.8     Comp.9    Comp.10     Comp.11      Comp.12
## [1,] -1.35406455 -1.79859111 -0.1967463 -0.6210496 -0.07516876  0.064074609
## [2,] -1.90222604  0.01626572  0.3920651 -1.5241403  0.21964807  0.007796853
## [3,] -0.07777814 -0.71096017 -0.9949215 -0.6051153  0.49784449 -0.834957242
## [4,]  0.15789439 -1.01320348 -0.2139452  0.2249570 -0.36933090 -0.616562556
## [5,]  0.30477027 -1.15629432 -1.0738907  0.6309555 -0.63205422  0.617776376
## [6,] -0.50365025  0.18179976  0.7681598 -0.6749029  0.60219646 -0.295296090
##          Comp.13     Comp.14     Comp.15    Comp.16     Comp.17     Comp.18
## [1,] -0.20125072 -0.29307706 -0.09472695 -0.1997637 -0.01744789  0.05698116
## [2,]  0.42135876  0.41714658 -0.20085743 -0.3385085  0.04616254  0.46220683
## [3,] -0.19812970 -0.28887292  0.14002850 -0.3036917 -0.14621640 -0.17689040
## [4,] -0.08800209 -0.66700110  0.58083675  0.5331028 -0.67741049 -0.33959915
## [5,]  0.45069334 -0.01822462 -0.11446789  0.2431517  0.27335244 -0.13261024
## [6,] -0.01044616 -0.21174287  0.18066580 -0.1458738  0.05415001 -0.26353362
##           Comp.19      Comp.20       Comp.21       Comp.22       Comp.23
## [1,] -0.247447091 -0.001243464  4.769372e-16  7.766572e-16 -8.700296e-16
## [2,] -0.176166271 -0.290130988 -6.781830e-17  2.119216e-15  6.601222e-16
## [3,]  0.071808333  0.046755621 -6.098944e-16 -5.236594e-16 -3.096167e-16
## [4,] -0.243240887  0.120640998 -1.589266e-15  1.448638e-15 -1.794964e-15
## [5,]  0.388183841 -0.144746795  5.473973e-16 -1.439164e-15  2.753225e-15
## [6,]  0.007483009 -0.143857001 -9.857269e-16 -1.587027e-15  2.139928e-15
##            Comp.24
## [1,]  4.685124e-16
## [2,]  3.255400e-16
## [3,] -3.576545e-16
## [4,] -1.889961e-16
## [5,] -2.350924e-15
## [6,] -2.195283e-15
options(warn = oldw)

Orthogonality of principal components

oldw <- getOption("warn")
options(warn = -1)
cor(pcaobj$scores)
##                Comp.1        Comp.2        Comp.3        Comp.4        Comp.5
## Comp.1   1.000000e+00 -1.580051e-16 -1.395031e-16 -5.467266e-16 -1.883876e-16
## Comp.2  -1.580051e-16  1.000000e+00  4.055168e-17 -8.896307e-17  2.912284e-17
## Comp.3  -1.395031e-16  4.055168e-17  1.000000e+00 -1.267168e-16  1.746643e-16
## Comp.4  -5.467266e-16 -8.896307e-17 -1.267168e-16  1.000000e+00  7.730438e-18
## Comp.5  -1.883876e-16  2.912284e-17  1.746643e-16  7.730438e-18  1.000000e+00
## Comp.6  -1.222468e-15 -1.867001e-16  4.154121e-18 -2.312300e-16  1.723675e-15
## Comp.7  -3.250962e-16  3.957709e-16 -2.174745e-16 -2.010321e-16 -8.022707e-17
## Comp.8   6.664927e-16  1.215623e-16  2.786075e-16  4.949045e-16 -2.817185e-16
## Comp.9   6.188892e-16 -1.211573e-16  3.482079e-16  1.586384e-16 -2.162732e-16
## Comp.10  9.731154e-16 -1.679738e-16  4.083560e-16 -9.161361e-17  2.613200e-17
## Comp.11  7.211421e-16 -3.169926e-16  3.025853e-16 -6.582751e-16 -7.429708e-16
## Comp.12  9.446501e-16 -3.264599e-16  5.540094e-16 -6.454917e-16 -1.378628e-15
## Comp.13  1.837625e-15 -1.373976e-16 -2.710350e-16 -6.470112e-16 -3.823640e-16
## Comp.14  5.450171e-16 -2.449674e-16 -2.143550e-16 -3.840538e-16 -9.873178e-16
## Comp.15  3.908771e-16 -5.586375e-16 -3.517766e-16 -9.793049e-16  1.840103e-16
## Comp.16  1.696264e-16  1.905954e-16 -5.593340e-17  5.924718e-16  6.446849e-16
## Comp.17  8.384544e-16 -9.237489e-16 -4.606962e-16 -6.349307e-16  1.568606e-16
## Comp.18  2.002688e-15  6.857871e-16  2.835739e-16 -1.930175e-15  5.884643e-16
## Comp.19  1.508817e-15 -2.032157e-16 -8.743513e-16 -7.644958e-16 -9.528696e-16
## Comp.20  1.579654e-16  1.081969e-15 -1.624796e-15  3.518967e-16 -1.323764e-15
## Comp.21 -4.156225e-02  1.112226e-01  9.702701e-02  5.025419e-02  4.301070e-02
## Comp.22  1.193163e-01  1.317005e-01  1.823238e-01 -3.518933e-02  6.258250e-02
## Comp.23 -1.004774e-01  2.542828e-02  3.361096e-03  1.407656e-02 -9.923853e-03
## Comp.24  1.417533e-01  3.072999e-02 -1.563205e-02 -1.465899e-01  2.313722e-02
##                Comp.6        Comp.7        Comp.8        Comp.9       Comp.10
## Comp.1  -1.222468e-15 -3.250962e-16  6.664927e-16  6.188892e-16  9.731154e-16
## Comp.2  -1.867001e-16  3.957709e-16  1.215623e-16 -1.211573e-16 -1.679738e-16
## Comp.3   4.154121e-18 -2.174745e-16  2.786075e-16  3.482079e-16  4.083560e-16
## Comp.4  -2.312300e-16 -2.010321e-16  4.949045e-16  1.586384e-16 -9.161361e-17
## Comp.5   1.723675e-15 -8.022707e-17 -2.817185e-16 -2.162732e-16  2.613200e-17
## Comp.6   1.000000e+00 -1.743199e-16  3.490881e-16 -1.112087e-16  7.931291e-17
## Comp.7  -1.743199e-16  1.000000e+00 -8.740894e-16  3.373954e-16  6.178948e-17
## Comp.8   3.490881e-16 -8.740894e-16  1.000000e+00 -3.418759e-16 -1.974562e-16
## Comp.9  -1.112087e-16  3.373954e-16 -3.418759e-16  1.000000e+00  4.638567e-16
## Comp.10  7.931291e-17  6.178948e-17 -1.974562e-16  4.638567e-16  1.000000e+00
## Comp.11 -3.196747e-16  7.823402e-16 -2.032360e-16  1.261417e-15  2.618556e-16
## Comp.12 -4.304184e-16  8.624448e-16 -1.202871e-15  9.967707e-16  1.470154e-15
## Comp.13 -4.163868e-16  6.415369e-16 -3.351177e-16  9.358732e-16 -1.757242e-16
## Comp.14 -7.778673e-19 -2.407173e-16  9.877165e-16  1.258548e-15  1.324940e-15
## Comp.15  2.031605e-16 -4.950581e-16  2.445775e-15 -3.286640e-16 -2.757026e-17
## Comp.16  8.680878e-17 -1.024863e-15  8.003062e-16 -1.599559e-15 -7.890523e-16
## Comp.17  1.210716e-16  4.689345e-16 -2.631077e-16  1.651898e-16  4.875531e-16
## Comp.18 -7.222821e-16  7.888098e-16 -3.210576e-16  1.224203e-15  1.022465e-15
## Comp.19  2.906465e-16  6.800372e-16 -1.693469e-16 -5.163458e-16  1.584544e-16
## Comp.20  3.517859e-16 -1.025324e-15  1.893397e-15 -9.421336e-16  1.803581e-15
## Comp.21  1.019073e-01 -2.279398e-02 -2.077435e-01 -4.393539e-03 -7.335382e-02
## Comp.22 -3.921590e-02  3.805255e-02  3.843789e-02 -4.437388e-02  2.353334e-02
## Comp.23  3.509504e-02 -6.770753e-02  1.779734e-02 -1.496132e-02  2.548912e-02
## Comp.24  1.201719e-01  1.861835e-02  8.710037e-02 -1.363320e-01 -6.943967e-02
##               Comp.11       Comp.12       Comp.13       Comp.14       Comp.15
## Comp.1   7.211421e-16  9.446501e-16  1.837625e-15  5.450171e-16  3.908771e-16
## Comp.2  -3.169926e-16 -3.264599e-16 -1.373976e-16 -2.449674e-16 -5.586375e-16
## Comp.3   3.025853e-16  5.540094e-16 -2.710350e-16 -2.143550e-16 -3.517766e-16
## Comp.4  -6.582751e-16 -6.454917e-16 -6.470112e-16 -3.840538e-16 -9.793049e-16
## Comp.5  -7.429708e-16 -1.378628e-15 -3.823640e-16 -9.873178e-16  1.840103e-16
## Comp.6  -3.196747e-16 -4.304184e-16 -4.163868e-16 -7.778673e-19  2.031605e-16
## Comp.7   7.823402e-16  8.624448e-16  6.415369e-16 -2.407173e-16 -4.950581e-16
## Comp.8  -2.032360e-16 -1.202871e-15 -3.351177e-16  9.877165e-16  2.445775e-15
## Comp.9   1.261417e-15  9.967707e-16  9.358732e-16  1.258548e-15 -3.286640e-16
## Comp.10  2.618556e-16  1.470154e-15 -1.757242e-16  1.324940e-15 -2.757026e-17
## Comp.11  1.000000e+00  1.225829e-15 -4.719202e-16  2.634925e-16  7.437129e-16
## Comp.12  1.225829e-15  1.000000e+00 -4.355110e-16 -1.299380e-16 -3.893957e-16
## Comp.13 -4.719202e-16 -4.355110e-16  1.000000e+00 -1.366225e-15  4.838017e-16
## Comp.14  2.634925e-16 -1.299380e-16 -1.366225e-15  1.000000e+00  1.761264e-15
## Comp.15  7.437129e-16 -3.893957e-16  4.838017e-16  1.761264e-15  1.000000e+00
## Comp.16 -4.630255e-16 -8.494333e-16  1.852018e-15  6.638944e-16  8.541588e-16
## Comp.17 -1.118369e-15 -2.383811e-15 -9.664191e-17 -1.946162e-15  4.013444e-15
## Comp.18  4.686807e-16  4.818353e-16  9.795217e-16 -3.904184e-15 -2.583601e-15
## Comp.19 -1.185900e-15 -2.211004e-16 -9.818855e-16 -1.122310e-15  2.238004e-15
## Comp.20  1.972055e-16  5.032361e-16 -2.380021e-15 -2.099046e-15 -8.457771e-15
## Comp.21  3.966345e-02 -1.900281e-02 -3.736642e-03 -6.396278e-02  9.896368e-02
## Comp.22 -4.660411e-02  1.578913e-02 -1.055270e-01 -1.586519e-01 -1.172079e-01
## Comp.23  6.627937e-02  2.277461e-02  3.589016e-02  3.784641e-02  1.698165e-01
## Comp.24 -1.663969e-01 -3.761654e-01 -2.938639e-01 -5.471513e-02  1.291964e-01
##               Comp.16       Comp.17       Comp.18       Comp.19       Comp.20
## Comp.1   1.696264e-16  8.384544e-16  2.002688e-15  1.508817e-15  1.579654e-16
## Comp.2   1.905954e-16 -9.237489e-16  6.857871e-16 -2.032157e-16  1.081969e-15
## Comp.3  -5.593340e-17 -4.606962e-16  2.835739e-16 -8.743513e-16 -1.624796e-15
## Comp.4   5.924718e-16 -6.349307e-16 -1.930175e-15 -7.644958e-16  3.518967e-16
## Comp.5   6.446849e-16  1.568606e-16  5.884643e-16 -9.528696e-16 -1.323764e-15
## Comp.6   8.680878e-17  1.210716e-16 -7.222821e-16  2.906465e-16  3.517859e-16
## Comp.7  -1.024863e-15  4.689345e-16  7.888098e-16  6.800372e-16 -1.025324e-15
## Comp.8   8.003062e-16 -2.631077e-16 -3.210576e-16 -1.693469e-16  1.893397e-15
## Comp.9  -1.599559e-15  1.651898e-16  1.224203e-15 -5.163458e-16 -9.421336e-16
## Comp.10 -7.890523e-16  4.875531e-16  1.022465e-15  1.584544e-16  1.803581e-15
## Comp.11 -4.630255e-16 -1.118369e-15  4.686807e-16 -1.185900e-15  1.972055e-16
## Comp.12 -8.494333e-16 -2.383811e-15  4.818353e-16 -2.211004e-16  5.032361e-16
## Comp.13  1.852018e-15 -9.664191e-17  9.795217e-16 -9.818855e-16 -2.380021e-15
## Comp.14  6.638944e-16 -1.946162e-15 -3.904184e-15 -1.122310e-15 -2.099046e-15
## Comp.15  8.541588e-16  4.013444e-15 -2.583601e-15  2.238004e-15 -8.457771e-15
## Comp.16  1.000000e+00  7.059548e-15  2.328300e-15  1.191216e-15 -1.944070e-15
## Comp.17  7.059548e-15  1.000000e+00 -4.616353e-15  9.500629e-16  3.824335e-16
## Comp.18  2.328300e-15 -4.616353e-15  1.000000e+00  1.244276e-15 -9.280984e-15
## Comp.19  1.191216e-15  9.500629e-16  1.244276e-15  1.000000e+00  4.483573e-15
## Comp.20 -1.944070e-15  3.824335e-16 -9.280984e-15  4.483573e-15  1.000000e+00
## Comp.21  2.025343e-01  5.108472e-01  6.007043e-01  1.884813e-01  4.278068e-01
## Comp.22  1.252883e-01 -3.063308e-01  7.083177e-01 -4.863623e-01 -1.252973e-01
## Comp.23  2.739237e-02  3.806890e-01 -3.805075e-01  3.490994e-01 -7.226192e-01
## Comp.24  1.232140e-01  2.695339e-01  5.086775e-01 -3.813036e-01  3.710316e-01
##              Comp.21     Comp.22      Comp.23     Comp.24
## Comp.1  -0.041562254  0.11931627 -0.100477416  0.14175331
## Comp.2   0.111222641  0.13170054  0.025428284  0.03072999
## Comp.3   0.097027006  0.18232381  0.003361096 -0.01563205
## Comp.4   0.050254193 -0.03518933  0.014076564 -0.14658995
## Comp.5   0.043010703  0.06258250 -0.009923853  0.02313722
## Comp.6   0.101907284 -0.03921590  0.035095036  0.12017193
## Comp.7  -0.022793975  0.03805255 -0.067707533  0.01861835
## Comp.8  -0.207743484  0.03843789  0.017797340  0.08710037
## Comp.9  -0.004393539 -0.04437388 -0.014961324 -0.13633203
## Comp.10 -0.073353821  0.02353334  0.025489124 -0.06943967
## Comp.11  0.039663455 -0.04660411  0.066279369 -0.16639690
## Comp.12 -0.019002812  0.01578913  0.022774612 -0.37616542
## Comp.13 -0.003736642 -0.10552699  0.035890161 -0.29386386
## Comp.14 -0.063962785 -0.15865185  0.037846406 -0.05471513
## Comp.15  0.098963682 -0.11720790  0.169816487  0.12919642
## Comp.16  0.202534279  0.12528828  0.027392372  0.12321398
## Comp.17  0.510847244 -0.30633084  0.380689048  0.26953394
## Comp.18  0.600704251  0.70831770 -0.380507538  0.50867748
## Comp.19  0.188481268 -0.48636227  0.349099428 -0.38130358
## Comp.20  0.427806836 -0.12529729 -0.722619204  0.37103162
## Comp.21  1.000000000  0.15910117 -0.243195596  0.55928336
## Comp.22  0.159101174  1.00000000 -0.504555717  0.48854829
## Comp.23 -0.243195596 -0.50455572  1.000000000 -0.51009099
## Comp.24  0.559283357  0.48854829 -0.510090990  1.00000000
pairs.panels(pcaobj$scores,gap=0,bg=c("green","red","yellow","blue","pink","purple"),pch=21)