frequently asked questions
What is a motif?
The Merriam-Webster dictionary defines a motif as, "a usually recurring salient thematic element". If we transfer this definition to the biological realm, motifs are basically recurring short sequence elements. Their over-representation usually implies some functional significance. We see motifs everywhere in biological sequences. Some common examples are the TATA box for RNA polymerase binding, the KDEL endoplasmic reticulum sorting signal, and the SPxK/R cyclin-dependent kinase recognition motif.
Can motif-x predict potential phosphorylation sites in my protein?
The simple answer is no. motif-x is a tool designed to find over-represented sequence patterns, not phosphorylation sites.
What is motif-x designed to do?
Initially motif-x was designed to extract phosphorylation motifs from large tandem mass spectrometry-based data sets by making use of the intrinsic alignment of the data around the phosphorylation sites. Since then we have discovered that motif-x is particularly well suited to extracting motifs from practically any sequence based data set, with the simple caveat that the data first needs to be "pseudo-aligned" by centering it on a particular residue (motif-x does this for you, you just need to specify the residue to center on).
What is the central character parameter?
Since motif-x was initially designed to extract motifs from large data sets of modified peptides (usually phosphorylated tryptic peptides generated by large-scale tandem mass spectrometry experiments), the central character initially meant the modified residue. Now however, we realize that motif-x can actually extract motifs from any data set, but it still requires a residue on which to create a "pseudo-alignment" of the data. Note that the central character is case-sensitive, and must be a character that is actually found in the data set.
What is the width parameter?
The width is the total number of positions in the outputted motifs. That is, it is the sum of the wildcard as well as the non-wildcard positions (for example, the motif xRxRxxSxPxxxx has a width of 13, while the motif xxxSDxE has a width of 7). Since there is a fixed central residue, the width should always be an odd number. Most protein motifs are on the range of 6-8 amino acids long, so our default value of 13 usually captures most of this information.
What is the occurrences parameter?
The occurrences parameter refers to the minimum number of instances in the data set a motif should occur. If you have an enormous data set of thousands of proteins, you may want to set the occurrence threshold high (at say 50) to get generalized motifs. On the other hand, if you are looking for a motif in only a handful of proteins, chances are no single motif occurs 50 times, so you will probably want to set a low occurrence threshold (maybe 5 or so).
What is the significance parameter?
The significance parameter refers to the maximal P-value needed to "fix" a given position in the motif. The P-value is often related to the size of the foreground and background data set. If you are not finding any motifs in your data set it may be necessary to require a less stringent significance threshold.
What is the background parameter?
The background parameter is used to calculate the amino acid frequency distribution in background data. motif-x calculates this distribution dynamically throughout the motif building process, and this is at the heart of why motif-x is successful at extracting biologically relevant motifs. If you have a tandem mass spectrometry-based phosphorylation data set that came from yeast, then a yeast background should be selected. If you do not see an appropriate background from our list, then you can always upload your own background data set. (Because this website is a public resource, uploaded data sets are currently limited to a maximum file size of 10 MB.)
What does "use foreground, unaligned" (in the background parameters) mean?
Since motif-x creates a pseudo-alignment of FASTA data on a central residue, an appropriate background for a large FASTA data set would be to use the set of all non-centered peptides of a given width as the background. In this way, you can actually use your foreground data set also as your background (this is what we mean by "use foreground, unaligned"). Sometimes this works better, sometimes it does not. We suggest trying several different backgrounds to get optimal results. Also, since using foreground, unaligned can sometimes produce a relatively small background data set, it may be necessary to relax significance thresholds to get back motifs.
What is an MS/MS format analysis?
An MS/MS type analysis is an analysis of motifs from the peptides returned by a tandem mass spectrometry experiment. For example, if you did an MS/MS experiment where you enriched for tryptic phospho-peptides using SCX chromatography or IMAC resin and you searched your data using the SEQUEST algorithm, then you could upload your data directly to motif-x and search for phosphorylation motifs. SEQUEST output will often label phosphorylated residues as S* or S#, but you can use this data directly with motif-x. Just make sure you specify your central residue as S* or S# (or whatever it actually is). Note that indicating S* as a central residue, when you don't have any "S*" characters in your data will return no motifs.
What does "extend from" mean?
The "extend from" option is only relevant for MS/MS data since this data is usually composed of tryptic peptides, which have incomplete sequence information. The extend option allows users to map their peptides back to the appropriate proteomic database to extend those peptides on their N and C termini so that a complete motif analysis can be performed. Currently, this option is only available for the human, mouse and yeast proteomes. Peptides that cannot be found in the specified "extend from" database will be directly extracted from the MS/MS sequences provided where there are a sufficient number of residues surrounding the central residue to make a peptide that ultimately contains "width" characters. Otherwise, they will be reported statistically as "Not Found", and are ignored for subsequent analysis.
What is "pre-aligned" format?
A pre-aligned format is simply one in which the user has already pre-aligned the data set on a particular residue, and all members of the data set are exactly "width" characters wide. In this type of analysis entries should be one per line simply separated by carriage returns. Other data set formats are pre-processed into "pre-aligned" data sets as a distinct phase before use in the motif-x algorithm. These "pre-aligned" data sets may be saved for later possibly more rapid analysis as "pre-aligned" data sets by using the Save As option from your browser from the link provided in the results output.
What is "text" format?
Text format refers to an analysis in which the input data is linguistic in nature. In this mode motif-x will remove all punctuation and spaces, and will return language motifs. To do a text analysis the user should always either choose "use foreground, unaligned" or upload their own linguistic background, since using a proteomic background would not make much sense if you were looking for language motifs.
Can motif-x find motifs in DNA data?
Yes. At the current time however, none of the background data sets are DNA-based, so if you wish to do a DNA analysis you will need to upload your own background data set or use the "use foreground, unaligned" background option if you have a fairly large FASTA file. You should also keep in mind that it may be necessary to lower the significance threshold a good deal for DNA analyses.
Do you have any test data sets that I could try out?
Sure. You could try the protein-based test set we described in our paper. Download it here. Then upload it onto motif-x either by cutting and pasting the sequences or by uploading the actual file. You should then use the following parameters:
Foreground format = FASTA
Central character = S
Width = 13
Occurrences = 20
Significance = 0.000001
Background = "use foreground, unaligned"
If you did everything correctly you should get back the following 5 motifs: DxxSQxN, RxSxxL, TVxSxE, RxSxxP, and KSxxxI.
I think I am using all of the parameters correctly, but I still don't get any motifs back?
Try lowering your significance threshold and/or your occurrence threshold. If you still do not get any motifs back, then either: a) contact Dan (see next faq), b) accept the fact that you don't have any motifs in your data, or c) try using another motif discovery tool (not suggested).
If I have a question, can I contact the author?
Yes. And he is always willing to help you out with your motif analysis. Just contact him at daniel.schwartz(at)uconn.edu.
How do I reference motif-x?
An article describing the motif-x algorithm was published in Nature Biotechnology (November 2005). The paper and reference are both available from the main page of this site (http://motif-x.med.harvard.edu).