The PARIGA Server for Real Time Filtering and Analysis of Reciprocal BLAST Results

BLAST-based similarity searches are commonly used in several applications involving both nucleotide and protein sequences. These applications span from simple tasks such as mapping sequences over a database to more complex procedures as clustering or annotation processes. When the amount of analysed data increases, manual inspection of BLAST results become a tedious procedure. Tools for parsing or filtering BLAST results for different purposes are then required. We describe here PARIGA (http://resources.bioinformatica.crs4.it/pariga/), a server that enables users to perform all-against-all BLAST searches on two sets of sequences selected by the user. Moreover, since it stores the two BLAST output in a python-serialized-objects database, results can be filtered according to several parameters in real-time fashion, without re-running the process and avoiding additional programming efforts. Results can be interrogated by the user using logical operations, for example to retrieve cases where two queries match same targets, or when sequences from the two datasets are reciprocal best hits, or when a query matches a target in multiple regions. The Pariga web server is designed to be a helpful tool for managing the results of sequence similarity searches. The design and implementation of the server renders all operations very fast and easy to use.


BLAST Parameters
The home page presents a basic interface that asks for the dataset type (Protein/Nucleic Acid) and automatically selects the corresponding BLAST program (blastp/blastn).
By clicking on "Show Options" several menus will appear where the user can modify the default values of the most common BLAST parameters. Clicking on "Hide Options" will restore the initial screen.

Input Data
According to the options selected in the previous section, the system will ask the user to upload a protein dataset or a DNA/RNA dataset.
The two datasets can be uploaded as files or directly pasted in the input form. The two options are available by clicking the "change input type" button. If the second dataset is not provided, the software will perform a self-blast on the first dataset.

Test Datasets
Some example datasets are available in the "test datasets" section. Test datasets (nucleotide or amino acid sequences) can be selected via the scroll menu, and loaded into the input form by clicking on the "Load" button.

Run
The job is submitted by clicking the "run" button at the bottom of the page, and a new form with the jobID will appear. The jobID can be used to retrieve the results later.

Results
When a job is completed, a page with two tables will appear with the results of the Blast searches of dataset 1 vs 2 and dataset 2 vs 1.
The header of each table contains two menus (Query number and Query title) to jump to a particular result ("a"). Alternatively the "prev-next" buttons allow the user to scroll over the results ("b"). The "open filter" button ("c") opens a form where the user can select the minimum, maximum or both values for the desired parameters. By clicking the "filter" button, only matches satisfying the filters will be displayed.
The sequence alignment of the individual hits can be visualized by clicking on the hit name. 4 Selecting one the four icons available at the top of each table, a pop-up window with additional information will be displayed: Graphical report of the distribution of the Blast hits on the query sequence, where hits are colored according to the alignment score.
Export of the current table.
List of the sequences producing significant alignments, with score and E-value.
Blast statistics for the current search.

Logical operations
Pariga is able to perform logical operations with the results of the two Blast searches. The tools can be accessed by clicking the "search tools" link at the top of the result page.
This section enables the user to apply three different search criteria: -Common: Select two or more sequences in the first data set to find out whether they match the same sequence(s) in the second dataset -Cross: Select one sequence in the first data set to find if there is a reciprocal match with sequence(s) in the second dataset -Multiple: Select one sequence in the first data set to find out whether they match more than one region in the same sequence(s) in the second data set An overall summary of the results obtained with the different queries is available in the Summary Results section: the Common summary table contains a list of sequences with common hits; the Cross summary table contains the results of a cross search for each sequence; the Multiple summary table contains a list of sequences with multiple hits on the same sequence.
By using the options available in the Selection Tools section it is possible to quickly select/unselect a set of sequences or filter for sequences that match more than one region on the same entry (show multiple). 6 The reference dataset for the summary results and the individual searches can be selected using the Reference Dataset buttons. To perform a new search and discard the results of the previous one, the reset button has to be clicked. A contextual help ( ) is associated to each button. The two-columns table contains the names of the sequences in the two datasets that can be selected for the different searches.

Common
The Common search checks whether two (or more) sequences in the reference dataset (identified by the checkbox in the table header, "reference dataset") share common results among the matched sequence in the other dataset. By indicating the number corresponding to the sequences or selecting them via the checkboxes and clicking the "common" button, a table with results will appear. This table will indicate the name of the common matched sequence in the header, and the result parameters for each of the selected sequences. If more than one sequence in the dataset satisfies the logical query, the "next-prev" buttons can be used to scroll through the results. The "open filter" button allows the user to filter the results as previously described. The alignment will appear in a pop-up window by clicking on the sequence name.

Cross
The Cross search allows checking whether the selected sequence in the reference dataset is reciprocally matched by a sequence in the other dataset. Sequence selection is carried out as described in the previous section. Clicking the "cross" button will display two tables showing the reciprocal blast results. The "prev-next" pairs of buttons will allow scrolling over the results. The alignment will appear in a pop-up window by clicking on the sequence name.

Multiple
The Multiple search will check if the selected sequence in the reference dataset matches sequence(s) in the other dataset in more than one region. Once selected the reference dataset and the desired sequence, clicking the "multiple" button will show a table with the results, if any. The "prev-next" pairs of button will allow scrolling over the results. The alignment will appear as popup window by clicking on the sequence name.

Case study 1 -Identification of miRNA targets
Recent studies have shown that, in some forms of tumors, several genes elude the miRNA-based repression by a mechanism based on the alternative splicing of the polyadenylation signal [1]. In practice, two (or more) transcript splicing isoforms differ for the length of their 3'UTR only, one with the "canonical" length, the other(s) with a shortened variant downstream a secondary polyadenilation signal. While the former contains multiple sites for miRNA pairing the latter includes just one (or fewer) miRNA pairing site(s). The second isoform can then elude the miRNA driven repression.
We decided to investigate if targets of the miRNA family let-7 (predicted by TarBase [2]) contain multiple potential miRNA pairing sites across different polyadenylation sites in their 3' sequences.
We show here how we took advantage of PARIGA to investigate the problem.

Dataset 1: mature sequences of miRNA let-7 family, 17 sequences (downloaded from miRBase [3])
Dataset 2: 3'UTR sequences of transcripts of top 20 predicted target genes (according to TarBase), downloaded through Ensembl web sites [4] and processed using an ad-hoc python script to generate all possible variants terminating the original sequence at every polyadenylation signal (AAUAAA as indicated by [5]). This resulted in176 variants.

Results
As shown in the header of the first table, only 9 miRNAs had a match. The first is the hsa-let-7a-5p. Due to the 8 By performing the reciprocal analysis, only 37 sequences out of 176 had a match in the dataset 1. For example, in the first result, the ENST00000315927_var1, had a perfect match (100%) with 7 miRNAs, three of them with an aligned region of 12 nucleotides and the others with 11 aligned nucleotides. All of them are good candidates for mRNA repression since the perfectly aligned region maps the mirna's 5' [3] as can be noted in the hstart-hstop columns.
The results can be scrolled by clicking the "prev-next" button.
As described before, alignments can be retrieved by clicking on the sequence names.
Results can be further explored using the "search tools". 9 For example, we can investigate whether hsa-let-7a-5p and hsa-let-7b-5p share a common target. First, we select the reference database, then the two miRNAs by clicking on the checkboxes near their names. By clicking the "common" button a table appears showing that the sequence ENST00000315927_var1 is matched by the two selected miRNAs at the same potential binding site (positions 25-36, columns hstart, hstop in the table).
Using the ""prev-next" buttons we can browse the other results: We could also investigate, for a given mirna (es hsa-let-7c), which sequence is reciprocally blasted. First of all we have to select the reference database by selecting 1 in the Reference Dataset section, and the miRNAs by clicking on the checkboxes near their names. Then, by clicking the "cross" button, two tables will appear showing the results for the two blast searches, and highlighting only the two involved sequence. As an example in the pair considered here, the blast search of the hsalet-7c against the 3' dataset shows that the ENST00000315927_var5 has 8 matching regions of twelve nucleotides each (qstart,qstop columns in the first table) and viceversa. The ranking in the relative blast searches is indicated in the column "position" while the numbers in the parenthesis in the headers of the tables indicate the position of that sequence in the original dataset.

11
The results of a similar search for the mirna hsa-let-7d-5p is shown in the following figure, where the scroll menu to navigate among results is highlighted.
Finally, a further application could be the identification of UTR sequences which are matched by a given miRNA (the hsa-let-7d-5p in the figure) more than once. As usual, we have to select the reference database, then the desired miRNA, followed by clicking on "multiple" button. In this example, nine UTR sequences are matched by the selected miRNA more than once: in the following figure, ENST00000315927_var3 is matched by the hsa-let-7d-5p miRNA at the positions 25-36 and 2886-2897 (columns hstart,hstop). The scroll menu can be used to select other results. 13

Case study 2 -Identification of Pfam domains
Let us assume that we have several protein sequences for which no functional experimental evidences is known, or that are derived from in silico studies. We can investigate whether any of these peptides shows sequence similarity with annotated DNA topoisomerase I protein sequences.
On the contrary, the two splicing variant ENSP00000307928 and ENSP00000454207 share common results. It is noteworthy that both proteins belong to the gene ENSG00000174450, which has four splicing variants in the dataset.
Further, these two peptides have multiple matches against different sequences in the PF02919 seed dataset. As usual they can be showed by the scroll menu. Limits of the regions are listed in the qstart-qstop columns.
If we would like to know whether, for a given sequence, there are reciprocal hit matches we can select the sequence and then click the "cross" button.
In the case shown in the figure, the two reciprocal hit sequences are the ENSP00000307928 protein and the A7TJW1 putative topoisomerase from Vanderwaltozyma polyspora.
It can be noted that while the protein matches the domain sequence just once (Table 1), the domain sequence matches the protein in different regions (Table 2) with different similarity and length of the alignment. 17 Also in this case, when no sequences are cross matched a message will appear.
As we noted before the ENSP00000307928 protein is matched by the A7TJW1 sequence domain more than once. If we would like to further investigate this aspect we can use the "multiple" option. First, we click on the sequence name and then on the "multiple" button: a table with the regions matched by the query sequence will be shown in the result table.
As mentioned before, these two sequences have multiple matching regions. The column position indicated where these results rank in the blast searches, while the columns qstart,qstart show where the alignments occur. Also in this case, if no multiple sequences occur, a message will be displayed.