Skip to main content
Biology LibreTexts

R Practice: Using Linear Models Determine the Relationship Between Biodiversity and Socioeconomic Status (Part II: Visualizing Linear Models)

  • Page ID
    101344
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

     

    Technical Learning Objective: In this module, students will gain a better understanding of how to create and interpret linear regression plots using R.

     

    The Biodiversity Implications Implications of Socioeconomic Inequality - Visualization

    In the previous model, we used a linear model to test for the relationship betwee the Gini Ratio of Income Equation, which ranges from 0 (perfect equality) to 1 (complete inequality) and the percentage of declining resident bird species in a state. In this follow-up module, we are going to learn how to effectively visualize the results of these models. 

    Data from: Mikkelson, G. M., Gonzalez, A., & Peterson, G. D. (2007). Economic inequality predicts biodiversity loss. PloS one, 2(5), e444.

     

    Note: This is the second part of a two-part exercise. Please see Part I for guidance on interpreting linear model outputs.

     

    Loading Our Packages and Data 

    First let's start by loading in our packages! For this module, we will be using rstatix and tidyverse.

    library(tidyverse) #First, let's load the tidyverse packages, which we will use for data wrangling and visualization.
    
    bird <- read.csv(url("https://bio.libretexts.org/@api/deki/files/66201/GINI_BIRDS.csv?origin=mt-web"), sep = ',', head = TRUE)
    #read.csv loads our data file, and setting it to "bird" stores it in R's "brain" so that we can access it later
    
    head(bird) # Head allows us to visualize the first few rows of the dataset, so that we know whether it loaded okay

    We can also get a quick feel for our data using the summary command. This is a quick way, for example, to get an idea of the range of each of our data columns and to see whether there are any extreme outliers.

    summary(bird) # Shows the summary statistics containing a summary of each column in the dataset "bird." 
    

     

    The Components of a ggplot object

    After we've loaded in our data and gotten a quick feel for it, we can begin to visualize our data. A ggplot object is made up of three components:

    • the data - the dataframe being plotted, which can be piped into the figure or defined within the ggplot command 
    • the aesthetics (aes) - the aesthetics of the geometric object, such as color and shape, as well as the columns to be used for the x and y axes
    • the geometries (geom) - the type of plot (e.g. point, boxplot, histogram)

     

    The type of plot you use (the geom) will depend on what type of data you are working with. Here, we are looking for the relationship between two continuous variables, so we will be using a scatterplot. We know from our previous analysis (Part I: Exploring Linear Model Outputs) that the linear relationship between these two variables is borderline insignificant. Generally, you would only include a linear model line on a regression plot if the relationship is significant. We will add a linear model line here, however, for learning purposes.

     

    The first step is to pipe our data into a plot. We do so by using the formula:

    Name of Data %>%

    ggplot(aes(x = independent variable, y = dependent variable))

     

    In the case of our experiment, we are testing whether the % declining species depends on the Gini Ratio. We will therefore place the % declining species on the y axis and the Gini Ratio on our x-axis. 

     

    IMPORTANT: After each line of code, it's important to include a brief description of what your code does to keep track of it all. This is called commenting. Any line of code that starts with a # will be treated as a comment by R.

    Your code should look something like this:

    bird %>%
      ggplot(aes(x = GiniRatio_1969, y = DecResBird_1966)) #plotting two continuous variables, Gini Ratio and % declining resident bird species

    You now have data added to your plot and your aesthetics set, but the plot is blank because you have not yet chosen a geom. Next, we need to add data points and a regression line. To add data points, we will use the function geom_point(). To add a regression line, we will use the function geom_smooth( method = "lm" ). The reason why we use "method = "lm" is because this is a linear model. The grey area surrounding the line represents your model's error. There are several other model types that can be added with geom_smooth (e.g., a generalized additive model), but these models are beyond the goals of this module.

     

    IMPORTANT: Before each addition to the ggplot command, you have to use a "+" rather than a pipe (%>%).

    bird %>%
      ggplot(aes(x = GiniRatio_1969, y = DecResBird_1966)) + #plotting two continuous variables, Gini Ratio and % declining resident bird species
      geom_point() + # Creates a scatterplot 
      geom_smooth( method = "lm" ) # Adds a fitted line for the linear relationship between the two variables 
    
     

    Awesome, the regression has all the data needed!

     

    Effective Data Visualization

    There is still a long way to go to make our plot look nicer, however. Here are just a few ways to improve our visualization: 

    • We can change the labels of the x and y axes to be more descriptive using xlab("Text") and ylab("Text"). Make sure to include units for your variables where relevant!
    • We can customize the text size for the axes. We do this within our plot "theme"

    theme(axis.title.x = element_text(size = number), axis.title.y = element_text(size = number))

    • We can change the color or size of our points by setting the color and size within geom_point. 

     

    Note: Effective ata visualization is a complex skill to learn! There are a lot of online resources to help you on this journey. 

     

    After you've run this code, try out some of your own changes by replacing the color blue and the size of the axis titles.

    bird %>%
      ggplot(aes(x = GiniRatio_1969, y = DecResBird_1966)) + #plotting two continuous variables, gini ratio and Declining resident bird species
      geom_point(size = 2, color = "blue") + #adding points/creating a scatterplot and making this points larger (size = 2) and blue (color = "blue")
      geom_smooth(method = "lm") + #adds the best fit line for the linear model between the two variables
      ylab("Declining Resident Bird Species, 1966-2005 (%)") + 
      xlab("Gini Ratio of Family Income Inequality 1969") +
      theme(axis.title.x = element_text(size = 14), ## Changes x-axis font size to 14
      axis.title.y = element_text(size = 14)) ## Changes y-axis font size to 14

     

    Finally, let's add a theme to the regression plot. There are several set themes available for you to use, but for this regression lets use a simple black and white theme.

    To do so, use the function: theme_bw()

     

    Here are some other themes. Go ahead and try them out by changing and rerunning the code! 
    theme_gray()
    theme_classic()
    theme_minimal()

     

    Here is the code again, including the theme.

    bird %>%
      ggplot(aes(x = GiniRatio_1969, y = DecResBird_1966)) + #plotting two continuous variables, gini ratio and Declining resident bird species
      geom_point(size = 2, color = "blue") + #adding points/creating a scatterplot and making this points larger (size = 2) and blue (color = "blue")
      geom_smooth(method = "lm") + #adds the best fit line for the linear model between the two variables
      ylab("Declining Resident Bird Species, 1966-2005 (%)") + 
      xlab("Gini Ratio of Family Income Inequality 1969") +
      theme(axis.title.x = element_text(size = 14), ## Changes x-axis font size to 14
      axis.title.y = element_text(size = 14)) +  ## Changes y-axis font size to 14
      theme_bw()

     

     

    Visualizing Model Fit

    Looking at the graph, what can you say about the fit of this linear model?

     

    Answer

    The data points are widely scattered around the line of best fit, indicating that there is a weak correlation between the Gini Ratio and % declining resident bird species. There seem to be some outliers that might be skewing this relationship (e.g., the state that has 10% declining resident bird species). These data are unlikely to fit the assumptions of a linear model without additional statistical consideration.

     

    A male Kirtland's Warbler in a jack pine forest in Michigan, USA. Image shows a grey and yellow bird, a Kirtland's Warbler, in the foreground and jack pine in the background.

    A male Kirtland's Warbler in a jack pine forest in Michigan, USA by snowmanradio is licensed under CC BY 2.0. Kirkland's Warblers are near threatened species that are dependent on jack pine habitat.

     

    References:

    Holland, T. G., Peterson, G. D., & Gonzalez, A. (2009). A cross-national analysis of how  economic inequality predicts Biodiversity loss. Conservation Biology, 23(5), 1304–1313. https://doi.org/10.1111/j.1523-1739.2009.01207.x

    Hope, D., Gries, C., Zhu, W., Fagan, W. F., Redman, C. L., Grimm, N. B., Nelson, A. L., Martin, C., & Kinzig, A. (2003). Socioeconomics Drive Urban Plant Diversity. Proceedings of the National Academy of Sciences, 100(15), 8788–8792. https://doi.org/10.1073/pnas.1537557100

    Kuras, E. R., Warren, P. S., Zinda, J. A., Aronson, M. F. J., Cilliers, S., Goddard, M. A., Nilon, C. H., & Winkler, R. (2020). Urban socioeconomic inequality and biodiversity often converge, but not always: A global meta-analysis. Landscape and Urban Planning, 198. https://doi.org/10.1016/j.landurbplan.2020.103799

    Mikkelson, G. M., Gonzalez, A., & Peterson, G. D. (2007). Economic inequality predicts biodiversity loss. PloS one, 2(5), e444.