INTRODUCTION

The Perl script targetscan_50_context_scores.pl calculates context scores for a set of miRNA targets predicted by targetscan_50.pl. 
The TargetScan 5.0 context score code produces essentially the same output as displayed at TargetScan.org Release 5.0.

The script takes three input files
	1) a miRNA file: a tab-delimited text file of mature miRNA information.  This is different from the file required by targetscan_50.pl.
	2) a UTR file: a tab-delimited multiple sequence alignment of the 3' UTRs of genes from the desired species
		which is the same as the input file for targetscan_50.pl.
	3) a PredictedTargets file: output from targetscan_50.pl.

In this directory we are providing samples of all of the above files.

The sample files are: miR_for_context_scores.txt, UTR_sequences_sample.txt and targetscan_50_output.txt
If you wish to generate these files from the complete data (linked from the table on the Data Download page), 
	run the commands shown below to convert them to the correct format.


FILE FORMATS

The format of the input files is important for the script to work correctly. 

Each line of the miRNA mature sequence file (ex: miR_for_context_scores.txt) consists of 4 tab separated entries:
1) miRNA family ID: Name of the miRNA family
2) Species ID of this miRNA family (which should match species IDs in the UTR and predicted targets input files)
3) MiRBase ID: name of a mature miRNA sequence
4) Mature sequence: sequence of mature miRNA

To generate a file in this format from the complete "miR Family" file (miR_Family_Info.txt from the table 
on the Data Download page), run this command:
cut -f1,3,4,5 miR_Family_Info.txt > miR_for_context_scores.txt

Each line of the UTR alignment file (ex: UTR_sequences_sample.txt) consists of 3 tab separated entries
1) Gene symbol or transcript ID
2) Species ID (which should match species IDs in miRNA input file) 
3) Sequence 

To generate a file in this format from the complete "UTR Sequences" file (UTR_Sequences.txt from the table 
on the Data Download page), run this command:
cut -f1,4,5 UTR_Sequences.txt > UTR_sequences_sample.txt

Each line of the predicted targets file (ex: targetscan_50_output.txt) consists of 13 tab separated entries
(although not all fields are required)
1)  GeneID - name/ID of gene (from UTR input file)
2)  miRNA family_ID - name/ID of miRNA family (from miRNA input file)
3)  species ID - name/ID of species (from UTR input file)
4)  MSA start - starting position of site in aligned UTR (counting gaps) 
5)  MSA end - ending position of site in aligned UTR (counting gaps) 
6)  UTR start - starting position of site in UTR (not counting gaps) 
7)  UTR end - ending position of site in UTR (not counting gaps) 
8)  Group ID - ID (number) of site(s) (same gene, same miRNA) that overlap 
9)  Site type - type of site in this species (1a [7mer-1a; type 1], m8 [7mer-m8; type 2], or 8mer [type 3])
10) miRNA in this species - if "x", then this miRNA has been annotated in this species
11) Group type - type of this group of sites; if 'Site_type' in a 'Group_ID' is heterogeneous, "weakest" type of the group is used
12) Species in this group - list of species names/IDs in which this site is found
13) Species in this group with this site type - for hetergeneous groups only


EXECUTION

The script can be executed in 3 different ways:
1) Running the script without any arguments (./targetscan_50_context_scores.pl) will print out a help screen.
1) Running the script without the '-h' flag (./targetscan_50_context_scores.pl -h) will print out a formats of input files.
2) Running the script with input filenames and output file will perform the analysis. Ex:
	./targetscan_50_context_scores.pl miR_for_context_scores.txt UTR_Sequences_sample.txt targetscan_50_output.txt targetscan_50_context_scores_output..txt

OUTPUT FILES

In this folder is a sample output file called "targetscan_50_context_scores_output.txt".
The output file also contain several tab separated entries per line.

The sample output file has a headers that names each column:
1)  Gene ID - name/ID of gene (from UTR input file)
2)  Species ID - name/ID of species (from UTR input file)
3)  Mirbase ID - name of a mature miRNA sequence
4)  Site Type - type of site in this species (1a [7mer-1a; type 1], m8 [7mer-m8; type 2], or 8mer [type 3])
5)  UTR start - starting position of site in UTR (not counting gaps) 
6)  UTR end - ending position of site in UTR (not counting gaps) 
7)  3' pairing contribution - one component of context score (see Grimson et al., 2007 for details)
8)  local AU contribution - one component of context score (see Grimson et al., 2007 for details)
9)  position contribution - one component of context score (see Grimson et al., 2007 for details)
10) context score - context score, the sum of the above three contributions and site-type contribution
	(-0.310 for 8mer, -0.161 for 7mer-m8, -0.099 for 7mer-1A)
11) context score percentile rank - percentage of sites for this miRNA with a less favorable context score
12) UTR region - subsequence of UTR used to show predicted consequential pairing
13) UTR-miRNA pairing - predicted consequential pairing of target region and miRNA; complementary bases indicated by bars
14) mature miRNA sequence - mature sequence of this Mirbase ID
15) miRNA family - name/ID of miRNA family
16) Group # - ID (number) of site(s) (same gene, same miRNA) that overlap 

NOTES

This script was designed on a Linux platform. While running this script on Windows or Mac platforms, make sure to call the native perl binary. 
This can be done explicitly by executing it as 'perl targetscan_50_context_scores.pl' or changing the first line of the script to point to the native binary. 


QUESTIONS/SUGGESTIONS:

Please direct all correpondence to wibr-bioinformatics@wi.mit.edu

