High-Throughput Tabular Data Processor – Platform independent graphical tool for processing large data sets

High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).

Selecting unique rows within each sample (column "SAMPLE") on the base of two columns (PROTEIN and REGION)*.
*-the result data set is identical with selecting unique rows on the base of three columns "SAMPLE", "PROTEIN" and "REGION".

Note:
The sample names unification in both files is required before joining (column file_name##).

(menu: Queue > Add query > Simple filter > input data preprocessing)
The sample names should be in scheme AP-[number], eg. AP-3. Any other substrings should be removed so eg. string "SeattleSeqAnnotation134.AP-1_aln_n30_N100_0x0004_filtered_sorted.bam.vcf.245647667364.txt" should be reduced to "AP-1". The operation may be accomplished using cutting or replacing options in "Simple filter" tab of "Add query" window ("input data preprocessing" form). Similar task should be done for columns with chromosome number (the column should contain only chromosome numbers without "chr" prefix in both files Add column with genetic conservation to results of previous steps using files downloaded from UCSC Genome Browser and modified accordingly to description in note. The procedure described below should be repeated for two results files (st1_pilot_report.txt and st1_strict_report.txt):

Open file
Example_4/input_data/Supporting_Table_S1.csv Remove numbers "1" or "1,2" at the end of strings in column patient so that cell content was eg. 924T instead 924T1,2 Select only rows containing string "608H" in column patient (button: Filter) Save result file 608h_608h.txt Select only rows containing string "608T" in column patient (button: Filter) Save result file 608h_608t.txt Select only rows containing string "608M" in column patient (button: Filter) Save result file 608h_608m.txt Select only rows containing string "null" in column patient (button: Filter) Save result file 608h_null.txt Note: The sample names unification in st1.txt file is required (column vcf_sample_name).

(menu: Queue > Add query > Simple filter > input data preprocessing)
The sample names should be in scheme [number]H, eg. 924H. Any other substrings should be removed so eg. string "galaxy41-[ap924p_sam-to-bam_on_data_23__converted_bam]" should first be reduced to "924p" and next the letter "p" in resultant sample name string should be replaced by letter "H" giving string "924H". The operation may be accomplished using replacing options in "Simple filter" tab of "Add query" window ("input data preprocessing" form). There are many command line tools, software packages and programming languages that provide alternative ways to perform complex operations on text files with the same results. Many of them are native to linux/unix systems. The table below briefly presents a choice of the most obvious methods to achieve the analogous outcome as the results that were delivered by HTDP in examples as described in the paper (S2 file) (https://osf.io/pw2dx/). The file names used are real and can be found in "input data" or "results" subfolders of the relevant example. Despite availability of many ready-to-use tools, some stages of data processing are difficult to achieve using relatively short commands -in such cases writing specific scripts is necessary. All presented examples print results to the standard output which may redirected to a file with '> output_file_name.txt' string added at the end of command. This task may be carried out using many programing languages (bash script, perl, php, using sql database depending on data amount). The script should make an array of proteins from column "PROTEIN" and samples from column "SAMPLE", count the percentage of presence of each protein in each sample and select only proteins meeting the critera and next select only rows containing names of selected proteins. 1 Horizontal joining of the files on the basis of one column (column PROTEIN) with filtering and without filtering join --header -a 1 -1 2 -2 1 <(sort -k 2,2 example_input_file2a.txt) <(sort -k 1,1 example_input_file2b.txt) join --header -1 2 -2 1 <(sort -k 2,2 example_input_file2a.txt) <(sort -k 1,1 example_input_file2b.txt) join -bash command writing to standard output a line for each pair of input lines that have identical join fields.