R Practice: Using Pattern Matching and Metadata to Connect with Local Ecology (Part I: Locally Endangered Fishes)
- Page ID
- 107255
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)
Technical learning objective: In this module, you will learn how to use a variety of common tidyverse functions (filter, group_by, summarize) and specialized pattern matching functions (grepl). You will also be asked to use metadata to interpret your dataset.
Whenever you see text that is highlighted in orange, you are being asked to add code of your own!
Local Endangered Species
When thinking of the word "endangered", what species comes to mind?
Likely, we first think of large, charismatic species whose conservation is most likely to make headlines. But, many of us live in areas with much less visible endangered species. In this activity, we help to cultivate a sense of local connection by improving knowledge of the species endangered in your home state. We will do this using a data set that lists endangered fishes based on state.
Loading Packages and Data
Let's start off my loading our packages and data. Here, we are going to do something a little different and have you name your dataset.
Focusing on Your State
Now that we have loaded the data as a file in our workspace, we will begin to isolate the data we want to work with. Let's start by looking at a specific state. Fill in your state below and save it to the object 'my_state'. The way you enter your state must match exactly with the format used in the table (i.e., an all-caps, two-letter code). This type of data will represent a character "string" and needs to be surrounded by quotation marks (e.g., "AZ" for Arizona.)
If you take a look at fishes data set, you will notice that some of the entries contain multiple states. These entries represent waterways that occur across state boundaries. Now that the chosen state has been stored as the object 'state' we have to find a way to include all entries that contain your chosen state, including those which contain multiple states under them.
If each entry only ever included one state, this would be much easier, as we could use the following code (make sure to replace #name# with your data set name):
Unfortunately, this code only filters for rows that exclusively contain the specified state. This is where we would like to introduce the concept of the grepl() function, which allows us to conduct pattern matching in R. This function searches for matches with a specified pattern from a specific vector or column within a data set. Pattern matching can manifest itself in many forms and have many internal arguments, but we will only be looking at the following:
grepl(pattern, x), this is a function of pattern matching which focuses on the identifying portion of pattern matching.
The first argument within the function focuses on the character string of the 'pattern' that we are trying to identify within a vector or column within our data set. What would go in the pattern section of our code? That's right, we will use the state object we created.
x is the second argument within the grepl() pattern matching function which is used to determine vector or column in which the function should be applied. What is the name of the state column within the data set? This will be important for the next portion of the coding exercise.
More information on pattern matching can be found online.
Now let's give this pattern matching a try. First we will start by giving what we will code a name so that we have an output table, we will name it state_fish for simplicity. Then we apply the functions under the data set we loaded and using a pipe (%>%) we will apply the filter() and grepl() functions.
Remember how we said that filter wouldn't work. Well, it wouldn't work by itself. It does, however, work perfectly for our purposes if combined with the grepl function. After the pipe, we insert the filter() function and within it the grepl() function. Using what was explained in the last paragraph, we will code for the grepl() pattern matching function. The shell of the code has been provided, but it is up to you to finish the code block by telling the grepl function the column it is searching for the pattern in.
Using Metadata to Understand Your Columns
Now that we have isolated the data for your desired state, we will answer the following question: Which place within your chosen state has the highest proportion of endangered fish species? Now, we have to figure out how to code the endangered species proportion for the desired region. There are many columns within the data set, so your job now is to find the two columns necessary to calculate a "proportion endangered".
But, how do you even know where to start with choosing these two columns? This brings us to the metadata, the data about the data, which is a key component of any dataset. The metadata is important because it contains the meaning and units of the columns in the data set. Please run the following code chunk to access the metadata within your workspace.
Now that you have access to the metadata, we are going to group by basins within your state (BASIN) and calculate the percentage of species endangered. We will first pipe our data set into a group_by function, then we will use summarize to calculate the percentage of species endangered.
Within the summarize function, we need to give our name column a name (per_endangered). We will set this equal to an equation that calculates the percentage of species endangered using two other columns in the data set. Hint: You will want to use a column regarding ESA designation and a column for the total fish counts for thsi calculation.
Putting the data in descending order will allow us to more easily answer the question we posed at the beginning of this exercise. We do this using the arrange() function. As a default, arrange sorts the data by ascending order with a column name as an argument within the function. By adding desc() as a function within the arrange() function, we will see the data sorted by descending order. To specify a column just add it within the desc() function. In this case, we will use the per_endangered column that we created.
Now, you will practice the previous block of code on your own. The chosen region size and percent endangered columns must be different from the example. If you are having trouble please refer to the previous examples and metadata to get and output.
For this code to run, you must replace #column1# and #column2# with the relevant columns from the dataset!
Great work! You can now continue onto Part II of this module, which will allow you to visualize your findings.
Which two columns did you need to use to calculate the percentage of endangered species?
- Answer
-
Your equation should read as follows: per_endangered = ESA_E/TOTAL_SP * 100