24 | 04 | 2017

GREP formulas

GREP (or Pattern Searching) is a command-line for searching and replacing regular expressions in plain text. It was created by Kem Thompson in 1973, and since then it is widely used by programmers all around the globe. And I personally love it! You will find it tremendously useful, with applications in all your areas of research. I use it mainly for editing genetic data (e.g. sequences, alignments, and nexus files such as phylogenetic trees); for GIS codes (e.g. in python); and for my long R codes. There is a grep package available for R, but you can also use it in TextWrangler for Mac.

For learning GREP, I seriously recommend to check Chapter 8 of the TextWrangler Manual. Also, you can check this group: http://groups.google.com/group/textwrangler. 

First, a few tips and basic language I use often:

.    any character except a line break (that is, a carriage return)
*   anything
+   repeated
?    likely, but not a condition
+?  one time or more
*?  zero or more times 
??  zero or one time
beginning of a line (unless used in a character class, where means except)        
$   end of line (unless used in a character class)
\    escape, unless combined with special commands (a few below)
\z   at the end of the document 
\b   for indicating word boundaries
[xyz]   any one of the characters x, y, z
[^xyz] any character except x, y, z
[a-z]    any character in the range a to z
[a-zA-Z0-9]  any character from a-z, A-Z, or 0-9
[-A-Z]  a dash or A - Z
[--A]    any character in the range from - to A
[^aeiou0-9]   any character that is neither a vowel nor a digit
\r   line break (carriage return)
\t   tab
\f   page break (form feed)
\\   backslash
\s   any whitespace character (space, tab, carriage return, line feed, form feed)
\S   any non-whitespace character (any character not included by \s)
\w  any word character (a-z, A-Z, 0-9, _, and some 8-bit characters)
\W  any non-word character (all characters not included by \w, including carriage returns)
\d   any digit (0-9)
\D   any non-digit character (including carriage return)
{197}  anything specified before, 197 times
\1, \2, …,\99   the text matched by the nth subpattern (or module) of the entire search pattern. Subpatterns are specified by parentheses () in the searching pattern. 
| Or
(?(condition)yes-pattern|no-pattern)  if-then-else
&   the text matched by the entire search pattern

Here are some examples of what you can do:

Note: I use [FIND] for the searching pattern (the text you need to enter in the space for "Find"), and [REPLACE] for the replacing pattern.

For alignments (using TextWrangler):

How do I remove a tail of sequences in the alignment? Suppose that I want to delete all the bases from position 598 to the end, in a huge alignment containing hundreds of sequences. Enter:
[FIND] ^(.{597})(.+)$  
How do I replace all the ambiguities from a nexus alignment by "-"? If the first base starts in position 29, then enter: 
[FIND] ^(.{28})(.+)R|Y|K|M|S|W
[REPLACE] \1\2-

Suppose that you a large alignment with hundreds of species, and the names have the form of "E_grandiflora_3689". However, the program you want to use can only accept strings of 10 characters. Do you go and change each name manually? Nooooo, use GREP search! In this example you want "E_gran3689". How do you shorten the names to six key characters?
[FIND] ^(.{6})([a-zA-Z]+)?(\D+)?(\d+)?(\s+)?
[REPLACE] \1\4\t


For editing phylogenetic trees

How do you delete all the branches support <70%? 
[FIND] \[&support=(([0-9]|[0-6].)\.[0-9])\]
[REPLACE] nothing

How do you delete all the supports from the terminal branches?
[FIND] ([a-zA-Z]_[a-zA-Z0-9]+:([0-9]+\.([0-9]+|[0-9]+1e-[0-9]+)))\[\&support=100\.0\]

Common operations

Ho do I change the order of elements in my text?
Suppose you have a large dataset or document with hundreds of dates using the American system (Month/Day/Year) and you need to change it to Day/Month/Year. That is very simple:
[FIND] (\d{2})/(\d{2})/(\d{4})
[REPLACE] \2/\1/\3  

Ho do I change a string in text?
Suppose you get a list of thousands of species names, some with subspecies, some with varieties and some combining both subspecies and varieties. You need to isolate just the names of varieties. After filtering leaving just names with "var. " on names, you can do that in two simple steps:
[FIND] ( var\. )

[FIND] ^([a-z ]+)(\t)(.+)

These are some of my codes and tips you might find useful:  GREP formulas  |  Dendropy.  Coming soon: R Codes  |  PAUP codes  |  Nexus for MrBayes  |  Garli codes  |  Python codes for GIS. 

Postdoctoral researcher | Botany | National Museum of Natural History | Smithsonian Institution
PO Box 37012, Washington DC 20013 | Phone: (202)633-0951 | Email: espeletias@gmail.com
Recent recommended papers

Read this paper by Madriñán et al. (2013): Páramo is the world's fastest evolving and coolest biodiversity hotspot.

My latest publication

This is my latest paper (Diazgranados & Morillo, 2013): A new species of Coespeletia (Asteraceae, Millerieae) from Venezuela. Check it out!

Latest organized event

The first Symposium of Biogeography of Neotropical Plants in Colombia was a total success! (Diazgranados & Funk, 2013)