The author has declared that no competing interests exist.
This article introduces the world of the Python computer language. It is assumed that readers have some previous programming experience in at least one computer language and are familiar with basic concepts such as data types, flow control, and functions.
Python can be used to solve several problems that research laboratories face almost everyday. Data manipulation, biological data retrieval and parsing, automation, and simulation of biological problems are some of the tasks that can be performed in an effective way with computers and a suitable programming language.
The purpose of this tutorial is to provide a bird's-eye view of the Python language, showing the basics of the language and the capabilities it offers. Main data structures and flow control statements are presented. After these basic concepts, topics such as file access, functions, and modules are covered in more detail. Finally, Biopython, a collection of tools for computational molecular biology, is introduced and its use shown with two scripts. For more advanced topics in Python, there are references at the end.
Python is a modern programming language developed in the early 1990s by Guido van Rossum [
The more important and relevant features of Python for our use are that: it is easy to learn, easy to read, interpreted, and multiplatform (Python programs run on most operating systems); it offers free access to source code; internal and external libraries are available; and it has a supportive Internet community.
Python is an excellent choice as a learning language [
There are also some drawbacks to Python that must be noted. First, execution time is slower than for compiled languages. Second, there are fewer numerical and statistical functions available than in specialized tools like R or MATLAB. (However, Numpy module [
Program functions and reserved words are written in
Python can be run in script mode (like C and Perl), or using its built-in interactive console (like R and Ruby). The interactive console provides command-line editing and command history, although some implementations vary in features. In the interactive mode, there is a command prompt consisting of three angle braces (>>>).
Script mode is a reliable and repeatable approach to running most tasks. Input file names, parameter values, and code version numbers should be included within a script, allowing a task to be repeated. Output can be directed to a log file for storage. The interactive console is used mostly for small tasks and testing.
Python programs can be written using any general purpose text editor, such as Emacs or Kate. The latter provides color-cued syntax and access to Python's interactive mode through an integrated shell. There are also specialized editors such as PythonWin, Eclipse, and IDLE, the built-in Python text editor.
When running a Python script under a Unix operating system, the first line should start with “#!” plus the path to the Python interpreter, such as “#!/usr/bin/python”, to indicate to the UNIX shell which interpreter to employ for the script. Without this line, the program will not run from the command line and must be called by using the interpreter (for example, “python myprogram.py”).
Numeric Data Types
Computer languages can be characterized by their data structures (or types) and flow control statements. Data structures in Python are diverse and versatile. There are numeric data types that hold “primitive” data (integer, float, Boolean, and complex) and there are “collection” types that can handle several objects at once (string, list, tuple, set, and dictionary). Descriptions and examples of numeric data types are summarized in
Usually enclosed by quotes (') or double quotes ("). Triple quotes (''') are used to delimit multiline strings. Strings are immutable. Once created they can't be modified. String methods are available at
For example:
>>> s0='A regular string'
Defined as an ordered collection of objects; a versatile and useful data type. C programmers will find lists similar to vectors. Lists are created by enclosing their comma-separated items in square brackets, and can contain different objects.
For example:
>>> MyList=[3,99,12,"one","five"]
This statement creates a list with five elements (three numbers and two strings) and binds it to the name “MyList”. Each element of the list can be referred to by an integer index enclosed between square brackets. The index starts from 0, therefore MyList[
Also an ordered collection of objects, but tuples, unlike lists, are immutable. They share most methods with lists, but only those that don't change the elements inside the tuple. Attempting to change a tuple raises an exception. Tuples are created by enclosing their comma-separated items between parentheses. Tuples are similar to Pascal records or C structs; they are small collections of related data that are operated on as a group. They are used mostly for encapsulating function arguments, or any data that are tightly coupled.
For example:
>>> MyTuple=(2,3,10)
Tuple operations are available at
An unordered collection of immutable values. It is mostly used for membership testing and removing duplicates from a sequence. Sets are created by passing any sequential object to the set constructor, such as: set([
For more information on sets, please refer to
For example:
>>> ResEzSet1=set(['BamH1', 'HindIII', 'EcoR1', 'SalI'])
>>> ResEzSet2=set(['PlaA', 'EcoR1', 'Eco143'])
>>> ResEzSet1&ResEzSet2
set(['EcoR1'])
A data type that stores unordered one-to-one relationships between keys and values. Unordered in this context means that each key–value pair is stored without any particular order in the dictionary. It is analogous to a hash in Perl or a Hashtable class in Java. Dictionaries are created by placing a comma-separated list of key–value pairs within braces.
For example:
Set Translate as a dictionary with codon triplets as keys and the corresponding amino acids as values:
>>> Translate={"cca":"P","cag":"Q","agg":"R"}
Creating a new entry:
>>> Translate["gat"]="D"
To see what is inside the dictionary:
>>> Translate
{'agg': 'R', 'cag': 'Q', 'gat': 'D', 'cca': 'P'}
Dictionaries share some methods with lists. A complete list of methods on can be seen at:
Flow control statements control whether program code is executed or not, or executed many times in a loop based on a conditional. Conditional execution (
Tests for a condition and acts upon the result of that condition. If the condition is true, the block of code after the “if condition” will be executed. If it is false, the program will skip that block and will test for the next condition (if any). Several conditions can be tested using
Scheme of an if statement:
block1
block2
block3
Iterates over all the members of a sequence of values (as in Perl's “foreach”). It is different from C and VB because there isn't a variable that increments or decrements on each cycle. This sequence could be any type of iterable object like a list, string, tuple, or dictionary. The code inside a for loop will be executed once for each item in the sequence, and at the same time the variable will take the value of each item in the sequence. There could be an optional
The structure of a for loop is:
block1
block2
Executes a block of code as long as a condition is true. As the for loop, there could be an optional
The general form is:
block1
block2
Python allows programmers to define their own functions. The
Python's function structure is:
Arguments are passed by reference and without specifying data types. It is up the programmer to check data types. When a function is called, the arguments must be supplied in the same order as defined, unless arguments are provided by using keyword–value pairs (
The return statement terminates the execution of the function and returns a single value. To return multiple values, a list or a tuple must be used.
In Python, functions, classes and constants can be saved in a file, called a “module,” for later use. Modules can be called from a program or in interactive mode using the “import” statement, such as:
where ModuleName is the name of the file without an extension. When a module is imported for the first time, its code is interpreted and executed. Execution upon import of certain code can be prevented by putting the code into an import executable conditional statement (if __name__ == __main__). The '__name__' attribute of the module is the name of the module and is '__main__' only when the module is run as a standalone program. Successive imports of the same module have no effect.
Python provides several modules and there are many more that can be downloaded from the Internet (like SciPy [
An example:
>>>
>>>
['__doc__', '__file__', '__name__', 'acos', 'asin', 'atan', 'atan2', 'ceil', 'cos', 'cosh', 'degrees', 'e', 'exp', 'fabs', 'floor', 'fmod', 'frexp', 'hypot', 'ldexp', 'log', 'log10', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan', 'tanh']
No error message returned by the interpreter means that the module was successfully imported.
For example:
>>> math.log(2)
0.69314718055994529
To show Python syntax and data structures in action, it is instructive to look at solving a real problem using this language, such as the calculation of the net charge of a protein. Given a protein sequence, this is performed by adding up the charges of each charged amino acid at pH = 7. This calculation gives a rough value because it doesn't consider whether the residues are exposed, partly exposed, buried, or deeply buried. This example shows functions, data types (numbers, strings, and dictionaries) and flow control (
This script defines a function (
Biopython is a distributed, collaborative effort to develop Python libraries and applications that address the needs of current and future work in bioinformatics [
A review of Biopython functions would require a far more considerable amount of space; therefore this paper shows only a small portion of the bigger picture. The first example shows how to parse a BLAST output to extract and report only required features. Since BLAST is the most commonly used application in bioinformatics, writing a BLAST report parser is a basic exercise in bioinformatics [
The program below extracts the title and sequence from some high-scoring pairs (HSP), but there are many more features to extract from a BLAST output, if needed. Biopython provides the Blast Record class under Bio.Blast.NCBIXML.Record. Internal documentation for this object can be accessed with
For this program, the user has to perform a BLAST search and save the result in XML mode because this format tends to be more stable than HTML or text versions (and hence the Biopython parser should be able to handle it without any problem [
This program, blastparser2.py, takes a BLAST output in XML format and shows the sequence of hits in Chromosome 5 that are larger than 80 base pairs long. A file handle named bout with the BLAST output in XML format is created, and then the file is parsed using the
In this example we use a simulated output provided by an external sequencing service. It consists of more than 6,000 directories (one for each clone), and there are three files per directory (a formatted report with a pdf extension, the sequencing machine output with an ab1 extension, and a plain text file with the sequence). This directory structure and its files are available as
The program fromdir2fasta.py to scan a directory (mydir) where the output of the sequencing service is downloaded. The names of all the directories under that directory are obtained with the
Reading a text file in Python is a three-step process.
1: Open the file, creating a handle.
handle=
The first parameter is the filename location. The second parameter is the first letter of the open mode, that is, r, w, and a, corresponding to read, write, and append. This function returns a file object (handle).
2: Read the file. There are several methods to gain access to the contents of a file:
handle.
handle.
handle.
For efficient iteration over a file, use “
3: Close the file:
handle.
Writing a file is very similar to reading a file. Steps 1 and 3 are the same as reading a file. The main difference is in step 2, where the file's contents are written with the
handle.write(“This text will make it into a text file\n”)
There is also a writelines method that writes each member of the list to a file.
Python's capabilities include scientific plotting [
One common problem for non-computer science researchers who start programming is that they usually stick to basic concepts and don't take advantage of many modern tools that are available [
There are many good quality resources for learning Python. Some of these have already been mentioned and a summary of resources is presented in
Resources for Learning Python and Biopython
On the last line of the code the function is called.
(107 KB DOC)
(107 KB DOC)
(48 KB DOC)
(172 KB GZ).
The author wishes to thank Virginia C. Gonzalez for her help, Dr. Diego Golombek, the anonymous reviewers for helpful comments, all the Biopython team for their work, and the local Python community (PyAR) for their support.
high-scoring pairs