Monday, December 18, 2017

Assignment 6

Goals: The goals of this assignment is to gain knowledge in SPSS in relation to regression analysis and be able manipulate data in Excel as well as map standardized residuals in ArcGIS.

Part 1:

Introduction:
Crime and poverty is always at a topic of interest for many policy makers. A particular town was included in a study on crime rates and poverty. The towns news station stated that crime increases as the number of kids who get free lunches increases, according to a particular data source. This is a strong claim from the news station, which deserves a further glance at the data. To see if this claim is correct, SPSS will be used to complete a regression to observe whether there actually is a relationship between the number of kids who receive free lunch and crime rates.

Methods:
To determine whether there is relationship between the independent variable (free lunches) and the dependent variable (crime rates), the data provided in an Excel document was used to create a scatter plot and OLS line (Figure 1). After that, the same data was uploaded into SPSS and a linear regression analysis was used to show the linear relationship between each variable (Figure 2). This analysis is important because it displays the regression coefficient (b) that explains how responsive the dependent variable is to change in the independent variable. More simplistically, it gives the direction of the relationship between each variable. Additionally, the analysis provides the coefficient of determination ( r2) which represents how the independent variable explains the dependent variable.
Figure 1: Scatter Plot Between Independent and Dependent Variables 



Figure 2: Linear Regression Analysis

Results:
By looking at the scatter plot and trend line, it is clear there is a somewhat positive relationship between the percentage of kids getting free lunch and crime rates. This relationships can be confirmed by looking at the regression coefficient (0.173) which shows the positive slope. Because .173 is significantly closer to 0 that it is to 1, this means that the relationship between the two variables is weak. It is also important to pay attention to the significance level: 0.005 (two-tailed test). This means that in relation to the hypothesis test of this analysis, the null hypothesis would be rejected because there is a relationship between the independent and dependent variables. If a new area of town turned out to have 30% of the kids having free lunch, the crime rate would then be 72.38. To determine this result the trend line equation was used: y=1.6852x+21.819 with 30 going in as 'x'. Although these variables show an existing relationship, it wouldn't be best to base assumptions off of these results due to the weak regression coefficient. 

Conclusion:
According to the results from the scatter plot and the linear regression analysis, the local news station was correct in saying that there was a positive relationship between the percentage of kids that receive free lunch and crime rates. However, the relationship between both variables is very weak which means that there might be other factors that effect each of these variables more than each other. To tie these variables together without knowing the cause would be misleading. Correlation does not imply causation. This variables should be examined individually before the news station shares this information with the town. 

Part 2:

Introduction: 
The city of Portland, Oregon is worried about whether the responses to 911 calls are sufficient. For better understanding, different factors will be looked at to help determine what areas most of the calls are coming from. This information will be useful to a company that wants to construct a new hospital so they can know where to place the ER and how big to build it. By analyzing which areas most of the 911 calls are coming from, the most accurate area of placement can be determined for the proposed hospital. 

Methods:
To start, data featuring number of 911 calls per census tract, number of people with no high school degree, number of people unemployed, and the number of people who were foreign born were created into three different scatter plots in Excel (Figure 3,4,5,6,7,8). The number of calls always being the dependent variable, and the other three variables representing the independent variables. Next, using SPSS, using calls as the dependent variable again and the other three as independent variables, a linear regression analysis was performed for each of the three combinations of data. After that, ArcGIS was used to create a choropleth map to show the number of 911 calls per Census Tract (Figure 9). Also, an additional standardized residual map was created to display the variable with the largest r2 value (Figure 10). This map essentially shows the standard deviation of the residuals or the amount of deviation from the regression line. All of this information will aid in testing the hypothesis. The null hypothesis: there is no linear relationship between the number of 911 calls and the number of people with no high school degree, number of people unemployed, and the foreign born population. The alternative hypothesis: there is a linear relationship between the number of  911 calls and the number of people with no high school degree, number of people unemployed, and the foreign born population                

Results:
Figure 3: Low Education and 911 Calls Graph

Figure 4: Low Education and 911 Calls Analysis
Looking at Figure 3 & 4, there is a positive slope represented from the trend line. The regression coefficient of 0.567 also shows positive slope and a fair strength between the variables. The significance level is at 0.000 meaning that there is a significant linear relationship between 911 calls and the number of people with no high school degree. This means that we can reject the null hypothesis. The equation y=0.166x+3.931 tells us that every person without a high school degree,  means that there is a 0.166 increase of 911 calls. 


Figure 5: Unemployment and 911 Calls Graph

Figure 6: Unemployment and 911 Calls Analysis
Looking at Figure 5 & 6, there is a positive slope represented from the trend line. The regression coefficient of 0.543 also shows a positive slope and a fair strength between the variables. The significance level is once again at 0.000 meaning there is a significant linear relationship between 911 calls and the number of people who are unemployed. This means that we can reject the null hypothesis. The equation y=0.507x+1.106 tells us that each person who is unemployed, increases the number of 911 calls by 0.507. 

Figure 7: Foreign Born Population and 911 Calls Graph
Figure 8: Foreign Born Populations and 911 Calls Analysis
Looking at Figure 7 & 8, there is a positive slope represented from the trend line. The regression coefficient of 0.552 also shows a positive slope and a fair strength between the variables. The significance level is once again at 0.000 meaning there is a significant linear relationship between 911 calls and the number of people who were foreign born. This means that we can reject the null hypothesis. The equation y=0.080x+3.043 tells us that each person who was foreign born, increases the number of 911 calls by 0.080. 

Figure 9: Choropleth Map
Figure 10: Standardized Residual Map

Figure 9 displays the choropleth map of the number of 911 calls per census tract in Portland, OR. The darkest blue tracts represent the highest numbers of 911 calls and then ranges down to the lightest blue representing very few calls. It looks like the most calls are coming from the census tracts located in the upper middle of Portland. The independent variable that had the highest  rvalue turned out to be number of people with no high school degree. Figure 10 shows data that represents the number of 911 calls and the the number of people with no high school degree per census tract. Standard deviation helps to show which census tracts fell above or below the regression line. Red meaning above and blue meaning below. This means that these areas have higher or lower calls than the regression line predicted. When comparing Figure 9 and Figure 10, the two dark red census tracts on Figure 10 are also shown as dark blue on Figure. This makes sense considering both maps show significant number of 911 calls coming from those census tracts. 

Conclusion: 
On Figure 10, two census tracts are highlighted in between two other census tracts that were over 2.5 standard deviations. The two highlighted tracts were chosen because they are directly located between the two census tracts with high 911 calls. This would a good area for a potential hospital. Although it would be located in an area that doesn't have the highest calls, it would directly between two areas that do have high calls so that the people in need of an ER will be equally close respectively from both sides. Although the independent variable that had the highest  rvalue turned out to be number of people with no high school degree, the other two variables had very similar rvalues as well and should also be considered when examining locations to put potential hospitals. 


Monday, December 4, 2017

Assignment 5

Introduction: 
The purpose of this assignment was to learn how to analyze correlation and spatial autocorrelation while using Excel, SPSS and GeoDa. This assignment is separated into two sections and each part uses different data and requires critical thinking skills to understand the correlations and explain the patterns.

Part 1: Correlation
Question1:
For part one of this assignment, a scatter plot was created and a trend line was inserted in Excel from information provided relating to the correlation between distance (ft) and sound level (dB)(Figure 1).

Figure 1: Scatter Plot
Next, SPSS was created to make a bivariate correlation to find the Pearson Correlation (Figure 2).

Figure 2: Pearson Correlation
Now with all of the necessary information, the hypothesis can be determined. So, the null hypothesis is that there is no correlation between the distance (ft) and sound level (dB). The alternative hypothesis states that there is a correlation between the distance (ft) and sound level (dB). By looking at the results from the Pearson Correlation at -.896, it is observed that there is a negative correlation between the variables. It is a strong negative relationship between sound level and distance because the Pearson Correlation was so close to -1. Also, the significant level ended up being 0.000 which is smaller than 0.005, so we can reject the null hypothesis because there is clearly a correlation between distance and sound level.

Question 2: 
After that, a correlation matrix was created in SPSS using data from census tracts and population in Detroit, Michigan (Figure 3). This data holds information about different races in Detroit as well as other variables like Finance, Bachelor's Degrees, Median Household Income, etc. The correlation matrix shows relationships between each of these variables.

Figure: 3: Correlation Matrix
By looking at the matrix correlation we can observe a few patterns. First of all, there is a positive relationship between the White population and having a bachelor's degree. This significance level for the two tailed test is at 0.000 marked by two asterisks indicating this correlation is significant at the 0.01 level.  Black and Asian populations also showed a correlation with having a bachelor's degrees but just not as high as White populations. The Hispanic population is the only one that doesn't have a correlation. Also, there is correlation between the Black and Asian populations with the manufacturing industry. The significance is at the 0.01 level for Black populations and the 0.05 level for the Asian populations. The correlations for the White and Hispanic populations are not significant for the manufacturing industry. Another notable pattern is between the retail industry and the White, Back, and Asian populations. White and Asian both have positive correlations while Black populations have a negative correlation. It appears that once again the Hispanic populations don't have a significant correlation with this category; this seems to be the trend amongst the other variables as well. A few assumptions I could make with the results of this correlation matrix is that the White and Asian populations seem to have a positive correlation with almost all of the variables listed. Black populations on the other hand seem to have negative correlations with each of the variables.

Part 2: Spatial Autocorrelation

Introduction:
For the second part of this assignment, I have been asked by the Texas Election Commission (TEC)  to analyze the patterns from the presidential elections from the years 1980 and 2012. The TEC has provided data from those election years in hopes that observations can be made about different patterns so they can inform the governor of Texas whether or not the voting patterns have changed between the 32 years. The specific data provided that has been given from the TEC is voter turnout and the percent Democratic vote for both election years. The information I will need to retrieve myself is the percent Hispanic populations for 2010 as well as the Texas state shapefile from the US census.

Methods:
To start, the percent Hispanic populations for 2010 in Texas from the US Census website was downloaded. The data included a lot of other unneeded variables so there needed to be a reduction from the amount of columns from the excel file to just the percent Hispanic. Next, the Texas shapefile was downloaded from the US Census website as well. After all of the data components were collected, ArcMap was opened and then uploaded the Texas shapefile and joined the percent Hispanic data and the data provided by the TEC through the Geo_ID field. After all the table were joined, I exported the new combined data as a shapfile so I could then open it in GeoDa and start to analyze the relationships between percent Hispanic populations, voter turnout, and the percent Democratic vote to determine whether there is a spatial autocorrelation between them. Once my new Texas shapefile was uploaded into GeoDa, I created a spatial weight with rook continuity. Then, scatter plots representing the Moran's I were created for voter turnouts and the percent Democratic vote for the years 1980 and 2012 as well as the percent Hispanic population for 2010 (5 scatter plots total). Also, LISA cluster maps were created for each of the five variables as well.

Results:
***Before displaying the map results, some background information about what the LISA cluster maps are showing is needed first. The red shown on the maps represents areas of High-High (+,+), which means that a specific area of high value is surrounded by other areas of high value. The pink shown on the maps represents areas of High-Low (+,-) which means that an area of high value is surrounded by low values and is considered an outlier. The blue shown on the maps represents areas of Low-Low (-,-), which means that an area of low value is surrounded by other areas of low value. Lastly, the light blue shown on the maps represents areas of Low-High (-,+), which means that an area of low value is surrounded by areas of high value and is also considered an outlier. Any white value indicates no significance. This information is important to know when analyzing LISA maps.

1980 Voter Turnout: The scatter plot (Figure 4) and the LISA map (Figure 5) really helps represent the data visually. The scatter plot shows a slight positive correlation with a Moran's I of 0.468. The LISA map shows high voter turnouts in northern Texas as well as a small area in the center. Low voter turnout ares are in southern and eastern Texas.
Figure 4: Scatter Plot of Voter Turnout in1980
Figure 5: LISA Map of Voter Turnout in 1980
2012 Voter Turnout: The scatter plot (Figure 6) for voter turnout in 2012 also has a very slight positive correlation with a Moran's I of 0.336. The LISA map (Figure 7) once again shows a low voter turnout for southern Texas. However, the year 2012 didn't have as many areas with high voter turnout.
Figure 6: Scatter Plot of Voter Turnout in 2012
Figure 7: LISA Map of Voter Turnout in 2012

1980 Percent Democratic Vote: The scatter plot (Figure 8) for the percent Democratic vote in 1890 has a more significant positive correlation than the others so far and has a Moran's I of 0.575. The LISA map (Figure 9) shows a low percent of Democratic voters in the northern part of the panhandle of Texas. There is a high percentage of Democratic voters in the southern tip of Texas as well as some of the eastern side.


Figure 8: Scatter Plot of Percent Democratic Voters 1980
Figure 9: LISA Map of Percent Democratic Voters 1980
2012 Percent Democratic Vote: The scatter plot (Figure 10) for the percent Democratic vote in 2012 shows another strong positive correlation along with a Moran's I of 0.696. The LISA map (Figure 11) shows some interesting differences between percent Democratic vote in 2012 compared to 1980. It looks like there were more counties with a low percent of Democratic this time around. Also, high areas of Democratic voters were now located in counties on the western side of the state instead of the eastern side. 
Figure 10: Scatter Plot of Percent Democratic Voters 2012

Figure 11: LISA Map of Percent Democratic Voters 2012
Percent Hispanic Population: The scatter plot (Figure 12) for the percent Hispanic populations additionally shows a high positive correlation with a Moran's I of 0.779. The LISA map (Figure 13) for percent Hispanic populations shows counties with low percents in the northeastern part of the state and counties with high percents in the southwestern part of the state. This LISA map also showed the least amount of outliers of all five.


Figure 12: Scatter Plot of Percent Hispanic Populations 
Figure 13: LISA Map of Percent Hispanic Populations
Conclusion:
Scatter plots and LISA maps are great ways to represent and visualize different sets of data so it is easier to find autocorrelations. A few correlations are discovered from these maps. First of all, high percents of democratic voters tend to be counties that also have high percents of Hispanic populations. This makes sense due to the fact that most Hispanic populations do tend to vote democratic. An interest difference that related back to the study question is that there were less counties with high voter turnout in 2012 compared to 1980. This could have been influenced by who was running for president at the time. Another interesting observation was the locational transition of voters from the percent Democratic populations in 1980 to 2012. There was a high percent of Democratic voter turnout on the eastern side of Texas in the 1980 election but in the 2012 election those some counties became insignificant and counties across the state on the western side became high percent. The cause of that transition would be of interest to look into further. Another notable observation is from Figure 13. This LISA map shows the smallest amount of outliers but those counties are located in interesting areas where there must be a significant reason for that outcome. Overall, the TEC could inform the governor of Texas that there have been a decrease of voter turnout since 1980 and that counties that have high percents of Hispanic populations will most likely have high percents of Democratic voter turnout.