TargetScan 6 Frequently Asked Questions (FAQs)

TargetScan Frequently Asked Questions (FAQs)

Web site

What are the definitions of "conserved miRNA families","nonconserved miRNA families","conserved miRNA sites" and "nonconserved miRNA sites"?

For miRNA families in TargetScan 6 (Human and Mouse), conservation cutoffs are as if Friedman et al. (2009):
- broadly conserved = conserved across most vertebrates, usually to zebrafish (Supplemental Table 1 of Friedman et al.)
- conserved = conserved across most mammals, but usually not beyond placental mammals (Supplemental Tables 2 & 3 of Friedman et al.)
- poorly conserved = all others
For miRNA sites in TargetScan 5 and 6 (Human and Mouse), site conservation is defined by conserved branch length, with each site type having a different threshold for conservation:
- 8mer >= 0.8
- 7mer-m8 >= 1.3
- 7mer-1A >= 1.6
For TargetScanFly 5 and 6, miRNA families conserved beyond the Sophophora subgenus are classified as conserved, and sites with branch lengths of at least 3.16 (60% of the total branch length) are classified as conserved.
For TargetScanWorm 5 and 6, miRNA families present in C. elegans and C. briggsae are classified as conserved, and sites present in all three species are classified as conserved.
Earlier versions of TargetScan, like Release 4, used simpler definitions of miRNA families and sites such as
- highly conserved = conserved across human, mouse, rat, dog, and chicken
- conserved = conserved across human, mouse, rat and dog

What do you mean by "Representative miRNA", "Aggregate P_CT", or some other TargetScan term?

Try clicking on the term, which in these cases appear as headings of tables on the web site. They link to pop-up windows with descriptions of terms like Representative miRNA or Aggregate P_CT.

If a gene has multiple transcripts, how can I know which one was used for target prediction?

For TargetScanHuman and Mouse 6, the annotated 3' UTR of each transcript of a gene was used for target prediction. The transcipt ID (NM_*) corresponding to the UTR annotation appears above the blue bar that represents the UTR in the top image on each gene-centric pages.
For TargetScan 5 and earlier, we selected the transcript with the longest 3' UTR, after removing any regions that overlap the coding region of another RefSeq transcript. The NM_* ID of the transcript (and its length) is shown in small text near the top of the gene page, just above the blue bar representing the gene.
For TargetScanWorm 5.2, we selected 3' UTRs, often more than one per gene, determined using the methods described in Jan et al., 2011.

I study mouse genetics. Should I use mouse predictions from TargetScanHuman or TargetScanMouse? How do they differ?

The difference between TargetScanHuman and TargetScanMouse is the origin of the 3' UTR annotations we use for miRNA target prediction.
For TargetScanHuman, we start with the longest annotated 3' UTR of the human gene. Once we have the human UTR, we select the UTRs of mouse (and other species) by orthology, as defined by the region of the mouse (or other) genome that aligns to the human UTR using multiz alignments from UCSC Genome Bioinformatics.
For TargetScanMouse, we start with the longest annotated 3' UTR of the mouse gene and select other species' UTRs in a similar manner.
In some cases, the UTR lengths differ between TargetScanHuman and TargetScanMouse, which could indicate, among other possibilities, that there are true species-specific differences or that one gene's annotated UTR may be incorrect.
For many genes, the human UTRs are better annotated than are the mouse UTRs. For these genes, the mouse predictions from TargetScanHuman will be more accurate. If a gene is important to you, you may want to look at the EST data to decide which annotation is more likely to be correct.

Sometimes the 3' UTR of my favorite genes don't match those I see in databases like NCBI RefSeq. As an example, mouse CD47 has a UTR length of about 4000 in TargetScanHuman, while NCBI RefSeq indicates a 3'UTR of about 800. Also, the mouse sequence in TargetScanHuman (NM_001025079) looks like it's actually a human sequence. Which annotation can I trust?

The answer for the previous question may help address some of these issues.
In addition,
Regardless of the species selected in TargetScan, we show the NM_* RefSeq accession for the reference species (human in TargetScanHuman, etc.). We don't make an attempt to link to RefSeq accessions in other species.
The RefSeq annotation may have been updated since we last ran the TargetScan pipeline.

How can I best predict miRNA site effectiveness? Should I look at site conservation or context score or something else?

For sites matching highly conserved miRNA families, there are two complementary choices, looking at either preferential conservation of the sites (Aggregate P_CT) or the context of the sites within the UTR (total context score).
- Aggregate P_CT has the advantage of identifying targeting interactions that are not only more likely to be effective but also those that are more likely to be consequential for the animal.
- Context scores have the advantage of being predictive for all types of interactions, including those of miRNAs that are not highly conserved.
- Another obvious consideration is whether or not both the miRNA and the mRNA are expressed in the cells of interest.

How can I download or export miRNA target predictions from TargetScan?

If you're interested in predictions for just a few miRNA families, you can paste the web page table, such as predicted miRNA targets of miR-31 into your spreadsheet. (You may first want to modify the subset of predictions, using the "View top predicted targets..." near the top of the page).
If you're interested in predicted miRNA targets of a few genes, go to a gene page like Human ARSJ and click on "View table of miRNA sites". The tables of the resulting page, like Human ARSJ Table can also be pasted into a spreadsheet.
For many or all predictions, download the zipped text files on the Data Download page of the desired TargetScan dataset. Some of these files may be too big for your spreadsheet, however, so you or someone else might need to use some Unix commands or a little programming to select the rows that you want to analyze.

How many unique genes are scanned by TargetScan?
- For Release 6, TargetScan looks for miRNA sites in this number of genes in each dataset:
  - TargetScanHuman = 18393 (30858 transcripts)
  - TargetScanMouse = 18615 (23795 transcripts)
  - TargetScanWorm = 16257
  - TargetScanFly = 14053
  A small number of transcripts (not included above) have an annotated 3' UTR less than 7 nt, the length of the shortest predicted sites.

How are 'star' miRNAs (such as hsa-let-7b*) included in the compilation of miRNA families?
- TargetScan ignores most star miRNAs, which are less often involved in functional or preferentially conserved interactions.
What is the relationship between miRBase miRNAs and families and those in TargetScan? Are they identical?

TargetScanHuman and Mouse 6 begins with all mature miRNAs from miRBase from species represented by the 3' UTR alignments. Note that the poorly conserved families also include small RNAs that have been misclassified as miRNAs.
The analysis pipeline for TargetScan 5 and earlier begins with mature miRNAs from miRBase, but in some cases we add or modify miRNA sequences to better reflect our current understanding of miRNA gene sets.
To confirm what miRNA families were used for target prediction, check on a TargetScan microRNA families page (such as for human Release 6.1) and/or the "miR Family" file on the Data Download page (such as for human Release 6.1). Also, since miRBase is updated more often than TargetScan, these databases can differ.
TargetScan does its own classification of miRNAs into families, based on identical seed region, so miRNA family classification and, even more likely, nomenclature can differ between these databases.

In TargetScan 5 Custom, nonconserved target sites are not shown; however, this information is available for annotated miRNAs. Is there a reason for this? Does TargetScan also predict nonconserved sites for all potential seed sequences?

TargetScan Custom does not show nonconserved sites because it would just be too much data to display, maintain, and visually interpet. Our pipeline for TargetScan Custom runs the code on all heptamers but then essentially ignores all nonconserved sites.
If you'd like to see all sites for a novel miRNA not included in our database, you can download the TargetScan code (from the bottom of a Data Download page such as that of TargetScanHuman 6.1) and run the TargetScan algorithm on any set of seed regions you want.

Where can I get more information about ...

The basic TargetScan (TargetScanS) algorithm? Lewis et al., 2005
Context scores? Grimson et al., 2007
Context+ scores? Garcia et al., 2011
Conserved branch length and P_CT? Friedman et al., 2009

I would like to include results from a TargetScan analysis in a paper we are about to submit. Can I use these data in my publication? How should I cite TargetScan? Are there copyright concerns?

You're free to use TargetScan data or code for your research and display TargetScan output in your presentations or papers, as long as you cite at least one TargetScan reference, preferably the most appropriate one based on the type of TargetScan data you used. These references are listed on the home page of each dataset of each release. The basic seed-based algorithm used by TargetScan is from Lewis et al., 2005. The use of context scores is from Grimson et al., 2007, and the use of preferential conservation (branch length and P_CT) is from Friedman et al., 2009. TargetScanWorm is from Jan et al., 2011, and TargetScanFly is from Ruby et al., 2007.

Data download

I'd like to download and use all predicted TargetScan targets for my project, but I'd like to limit the dataset to high-confidence sites. What thresholds for Pct or context score should I use?
- It might be better to choose thresholds for targets (total context scores or total Pct) rather than thresholds for sites, because multiple weak sites to the same miRNA can add up to more repression than a single strong site.
- With regard to specific thresholds, there is no clear answer to this question.
  - The thresholds that you choose will depend on where you want to be in the trade-off between sensitivity and specificity (i.e., whether you want to avoid missing too many authentic targets or whether you want to avoid including too many false-positives).
  - The relative weight that you place on P_CT versus context score will depend on the extent that you value biological function (best scored by P_CT, our measure of evolutionary conservation) versus the biochemical response to the miRNA (best scored by context scores, our measure of targeting efficacy).
  - Further complicating the choice of cutoffs is the likelihood that the best cutoffs for one miRNA family will differ from those for another family. For example, less stringent cutoffs might be more appropriate for a miRNA expressed at higher levels in the cells you are studying. Also, don't forget that the poorly conserved families include many small RNAs that have been misclassified as miRNAs and many others that are expressed at levels too low to perform effective targeting.
  - For all these reasons, we rank the sites and targets using two independent alternative scoring schemes (P_CT and context score) and leave it for the users to decide, based on their priorities and their biological questions, which criteria and cutoffs to use.
What is meant by "Species ID" in the download tables?
- The "Species ID" comes from the NCBI Taxonomy database.
- We also have tables of species used in the most recent versions of TargetScanHuman and TargetScanMouse

What do the numbers 1, 2, 3, etc. stand for in the "Site type" column of several tables?

These indicate that the site is classified as 7mer-1a (1), 7mer-m8(2), or 8mer (3).

7mer-m8: An exact match to positions 2-8 of the mature miRNA (the seed + position 8)
7mer-1A: An exact match to positions 2-7 of the mature miRNA (the seed) followed by an 'A'
8mer: An exact match to positions 2-8 of the mature miRNA (the seed + position 8) followed by an 'A'

Negative numbers refer to 3' compensatory sites.
For TargetScanWorm 5.2, predictions were expanded to include six site types, as defined in Jan et al., 2011. The additional site types are
- 6mer (6 in the "Site type" column): An exact match to positions 2-7 of the mature miRNA (the seed)
- 6mer-1A (5): An exact match to positions 2-6 of the mature miRNA (the seed) followed by an 'A'
- 8mer-1U (4): An exact match to positions 2-8 of the mature miRNA (the seed + position 8) followed by an 'U'
- Note that 8mer sites (above) are now named "8mer-1A" to better differentiate them from 8mer-1U sites.

In the "Family Conservation" column of the miR Family table, what do the numbers 2, 1, and 0 refer to?
- This indicates that a miRNA family is highly conserved (2), conserved (1), or poorly conserved (0).
I used BLAST to locate the TargetScan human 3' UTRs in the human genome. However, many of them didn't have perfect matches. Can you explain this?
- Some 3' UTRs are split between two or more exons, so you won't see one continuous match to the genome.
- You could also try BLAT, an effective tool for identifying genome positions (which can then be visualized in a genome browser).

The download files, such as those for TargetScanHuman Release 6.1 appear to contain data that is different from the web site. How can I recover the same prediction results present in the web pages?

The "Predicted Conserved Targets Info" file includes only data on conserved sites of conserved (and broadly conserved) miRNA families (sites shown in the first and second tracks) of a page like predicted targeting of HMGA2. Data in the third track ("Poorly conserved sites and sites for poorly conserved miRNA families") are not included. If you're interested all this third group of data, you'll also need the "Conserved Family Info" and "Nonconserved Family Info" files.
Context score data is split between download files in a different way. "Conserved site context scores" includes scores only for conserved sites (regardless of miRNA family conservation), whereas "Nonconserved site context scores" includes scores for nonconserved (poorly conserved) sites.

How do you indicate start and end positions of predicted miRNA sites in the Data Download files? Are these influenced by genomic position or gaps in the alignment? How do you account for introns within a 3' UTR?
- Start and end coordinates are calculated from spliced 3' UTRs, so introns are not included in the target prediction nor in the coordinate position counts. We do have two different coordinate systems:
  - "UTR start" and "UTR end" indicate the position in that species UTR, not counting alignment gaps
  - "MSA start" and "MSA end" indicate the position in that species UTR, counting alignment gaps in the Multiple Sequence Alignment.
We would like to download the TargetScan database. Is it possible to obtain the schema of the database so we can better understand the design?

At this time we only provide the files on the Data Download pages. To help with understanding these files, keep in mind that conservation can refer to a miRNA family and to a predicted miRNA target site -- and that these are independent. Also, you may want to differentiate data that applies to a miRNA family (position of a site, site type, conserved branch length, P_CT) from data that applies to a specific miRNA (context score).

TargetScan code

I am interested in using TargetScan on a wide set of data. Is it possible to run batch queries or download the code and run the program locally?
- We have code you can use for custom predictions at the bottom of the Data Download pages.
- One script predicts miRNA sites, and the other script calculates context scores for these sites.
- Each zipped archive includes code, instructions, and sample data.

Do I need to use UTRs from multiple species for miRNA target prediction?
- No, the TargetScan code runs fine with just one species. The main drawback of a single-species analysis is that you can't take advantage of comparative genomics to help differentiate between sites that have been conserved during evolution (which are more likely to be functional and more likely to be biologically consequential) from those that haven't.