Skip to main content
Biology LibreTexts

R Practice: Using Pattern Matching and Metadata to Connect with Local Ecology (Part I: Locally Endangered Fishes)

  • Page ID
    107255
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

     

    Technical learning objective: In this module, you will learn how to use a variety of common tidyverse functions (filter, group_by, summarize) and specialized pattern matching functions (grepl). You will also be asked to use metadata to interpret your dataset. 

    Whenever you see text that is highlighted in orange, you are being asked to add code of your own! 

     

    Local Endangered Species 

    When thinking of the word "endangered", what species comes to mind? 

    Likely, we first think of large, charismatic species whose conservation is most likely to make headlines. But, many of us live in areas with much less visible endangered species. In this activity, we help to cultivate a sense of local connection by improving knowledge of the species endangered in your home state. We will do this using a data set that lists endangered fishes based on state. 

     
    Loading Packages and Data

    Let's start off my loading our packages and data. Here, we are going to do something a little different and have you name your dataset.

    library(tidyverse) #loading the tidyverse package into the R workspace
    
    #name# = read.csv(url("https://bio.libretexts.org/@api/deki/files/70943/EndangeredFishes.csv?origin=mt-web"), header = TRUE) #save your data to R's brain so you can work with it later by replacing the #name# with a descriptive name and use that name below wherever you see #name#
    
    #name#

     

    Focusing on Your State
    Now that we have loaded the data as a file in our workspace, we will begin to isolate the data we want to work with. Let's start by looking at a specific state. Fill in your state below and save it to the object 'my_state'. The way you enter your state must match exactly with the format used in the table (i.e., an all-caps, two-letter code). This type of data will represent a character "string" and needs to be surrounded by quotation marks (e.g., "AZ" for Arizona.)
    my_state = #"yourstate"#  #fill in the two-letter code for your chosen state here, making sure to add quotation marks

     

    If you take a look at fishes data set, you will notice that some of the entries contain multiple states. These entries represent waterways that occur across state boundaries. Now that the chosen state has been stored as the object 'state' we have to find a way to include all entries that contain your chosen state, including those which contain multiple states under them. 

    If each entry only ever included one state, this would be much easier, as we could use the following code (make sure to replace #name# with your data set name):

    #name# %>%
      filter(HU_8_STATE == my_state)

     

    Unfortunately, this code only filters for rows that exclusively contain the specified state. This is where we would like to introduce the concept of the grepl() function, which allows us to conduct pattern matching in R. This function searches for matches with a specified pattern from a specific vector or column within a data set. Pattern matching can manifest itself in many forms and have many internal arguments, but we will only be looking at the following:

    grepl(pattern, x), this is a function of pattern matching which focuses on the identifying portion of pattern matching.
    The first argument within the function focuses on the character string of the 'pattern' that we are trying to identify within a vector or column within our data set. What would go in the pattern section of our code? That's right, we will use the state object we created.
    x is the second argument within the grepl() pattern matching function which is used to determine vector or column in which the function should be applied. What is the name of the state column within the data set? This will be important for the next portion of the coding exercise. 

    More information on pattern matching can be found online.


    Now let's give this pattern matching a try. First we will start by giving what we will code a name so that we have an output table, we will name it state_fish for simplicity. Then we apply the functions under the data set we loaded and using a pipe (%>%) we will apply the filter() and grepl() functions. 

    Remember how we said that filter wouldn't work. Well, it wouldn't work by itself. It does, however, work perfectly for our purposes if combined with the grepl function. After the pipe, we insert the filter() function and within it the grepl() function. Using what was explained in the last paragraph, we will code for the grepl() pattern matching function. The shell of the code has been provided, but it is up to you to finish the code block by telling the grepl function the column it is searching for the pattern in.

    state_fish = #naming the functions we are using to have an output table
    #name# %>% #please insert the chosen name for the data set
     filter(grepl(my_state, )) #using the filter and grepl functions we are coding to only include rows which contain your desired state code. You have to fill in the x part of the grepl() function which correlates to vector or column, of you are having trouble filling this in consider which column in our data set contains the state code data?
     
    Using Metadata to Understand Your Columns

    Now that we have isolated the data for your desired state, we will answer the following question: Which place within your chosen state has the highest proportion of endangered fish species? Now, we have to figure out how to code the endangered species proportion for the desired region. There are many columns within the data set, so your job now is to find the two columns necessary to calculate a "proportion endangered". 

    But, how do you even know where to start with choosing these two columns? This brings us to the metadata, the data about the data, which is a key component of any dataset. The metadata is important because it contains the meaning and units of the columns in the data set. Please run the following code chunk to access the metadata within your workspace.

    fish_readme = read.csv(url("https://bio.libretexts.org/@api/deki/files/70942/readme_EndangeredFishes.csv?origin=mt-web"), header = TRUE) 
    #loading in the metadata file, please read it and pay attention to the columns and their meanings. 
    
    fish_readme #now, take a look at your metadata

     

    Now that you have access to the metadata, we are going to group by basins within your state (BASIN) and calculate the percentage of species endangered. We will first pipe our data set into a group_by function, then we will use summarize to calculate the percentage of species endangered. 

    Within the summarize function, we need to give our name column a name (per_endangered). We will set this equal to an equation that calculates the percentage of species endangered using two other columns in the data set. Hint: You will want to use a column regarding ESA designation and a column for the total fish counts for thsi calculation.

    Putting the data in descending order will allow us to more easily answer the question we posed at the beginning of this exercise. We do this using the arrange() function. As a default, arrange sorts the data by ascending order with a column name as an argument within the function. By adding desc() as a function within the arrange() function, we will see the data sorted by descending order. To specify a column just add it within the desc() function. In this case, we will use the per_endangered column that we created. 

    Now, you will practice the previous block of code on your own. The chosen region size and percent endangered columns must be different from the example. If you are having trouble please refer to the previous examples and metadata to get and output.


    For this code to run, you must replace #column1# and #column2# with the relevant columns from the dataset! 

    state_fish %>% #specifying the data frame we are working with
      group_by(BASIN) %>% #using the function to group by specific regions
      summarize(per_endangered = (#column1#/#column2#) * 100) %>% #Here we are coding the proportion of endangered species and adding the columns used as extra parameters for summarizing
      arrange(desc(per_endangered)) #arranging our data by descending order with the percentage column

    Great work! You can now continue onto Part II of this module, which will allow you to visualize your findings. 

     

    Hint for Endangered Species Equation \(\PageIndex{1}\)

    Which two columns did you need to use to calculate the percentage of endangered species?

     

    Answer

    Your equation should read as follows: per_endangered  = ESA_E/TOTAL_SP * 100