LINKS:

 

PERL 

 

NCBI

 

BLAST

 

PHRAP 

 

ClustalX PC

ClustalX MAC

 

PERL Scripts

    note:  file extension names for scripts are posted with ".txt" extensions.  Users will need to change these to ".pl" in order to run in Perl.

 

Discovery pipeline:

 

The following protocol uses examples from tomato, but can be generalized to any species for which a sizable EST database exists.

 

Download expressed sequence tags (ESTs) from the National Center for Biotechnology Information (NCBI) dbEST database.  Note the release date of the database for your records. 

 

To download FASTA format files based on the origin of varieties you will need to become familiar with the major varieties used to construct the cDNA libraries for EST sequencing.  For example, in the EST database of tomato, TA496 and Rio Grande (including Rio Grande PtoR and the progeny of Rio Gande ´ Moneymaker, R11-12 and R11-13) predominate.

 

Download FASTA format files using the Entrez search and retrieval system for nucleotide data and phrase searching (e.g., Lycopersicon esculentum [ORGN] AND EST AND TA496).  Download the files to a local computer by directing the cgi text file to an appropriate folder. 

 

Perl scripts (version 5.6.0) are used to facilitate the manipulation and analysis of the FASTA sequence files. 

 

To index the FASTA sequences, EST entries extracted from the NCBI website are treated as input and modified by searching the description line for a specific string of "ESTxxxx" (where xxxx is the Genbank EST number), retaining only "ESTxxxx" as the entry name and adding a user-given extension name (TA496 or RioG) to the end of the entry names in the format of " ESTxxxx.Extension".  Each EST sequence is therefore indexed to the NCBI database using ESTxxxx and to a variety based on the assigned Extension name.

 

ESTs of each variety are assembled into a unique gene (unigene) contiguous sequence (contig) set using Phrap run on a workstation in the Linux operating environment. 

 

The Phrap output file was reduced to a file containing only contigs having 3 or more ESTs.  These EST names were then re-integrated with the correct sequence data to form a file consisting of a contig number (Contigxxx) followed by three sequence data sets each with the “ESTxxxx.Extension” name to form a FASTA format sequence file. 

 

Next, a single sequence from each contig was chosen and searched against the EST database of TA496 using Basic Local Alignment Search Tool (BLAST) run on a local workstation in the LINUX environment.  Three EST sequences from the TA496 data set for each contig were selected using a program that takes the output file resulting from the BLAST search as the input. 

 

The top three hits from the BLAST output file are extracted and the information was stored in one file. 

 

The three TA496 sequences from the BLAST extractor output file were then combined with three EST sequences from the Rio Grande contig data set to create a data set with three EST sequences of Rio Grande (or related pedigrees) and three EST sequences of TA496. 

 

The resulting six EST sequences were aligned using the sequence alignment program ClustalX (1.8), to identify possible SNPs.