Home  · Classes  · Annotated Classes  · Modules  · Members  · Namespaces  · Related Pages
PeptideIndexer

Refreshes the protein references for all peptide hits from an idXML file and adds target/decoy information.

pot. predecessor tools $ \longrightarrow $ PeptideIndexer $ \longrightarrow $ pot. successor tools
IDFilter or
any protein/peptide processing tool
FalseDiscoveryRate

Each peptide hit is annotated by a target_decoy string, indicating if the peptide sequence is found in a 'target' protein, a 'decoy' protein, or in both 'target+decoy' proteins. This information is crucial for the FalseDiscoveryRate IDPosteriorErrorProbability tools.

Note
Make sure that your protein names in the database contain a correctly formatted decoy string. This can be ensured by using DecoyDatabase. If the decoy identifier is not recognized successfully all proteins will be assumed to stem from the target-part of the query.
E.g., "sw|P33354_REV|YEHR_ECOLI Uncharacterized lipop..." is invalid, since the tool has no knowledge of how SwissProt entries are build up. A correct identifier could be "rev_sw|P33354|YEHR_ECOLI Uncharacterized li ..." or "sw|P33354|YEHR_ECOLI_rev Uncharacterized li", depending on if you are using prefix annotation or not.
This tool will also give you some target/decoy statistics when its done. Look carefully!

This tool supports relative database filenames, which (when not found in the current working directory) are looked up in the directories specified by OpenMS.ini:id_db_dir (see TOPP for Advanced Users).

By default the tool will fail if an unmatched peptide occurs, i.e. the database does not contain the corresponding protein. You can force the tool to return successfully in this case by using the flag allow_unmatched.

Some search engines (such as Mascot) will replace ambiguous amino acids ('B', 'Z', and 'X') in the protein database with unambiguous amino acids in the reported peptides, e.g. exchange 'X' with 'H'. This will cause this peptide not to be found by exactly matching its sequence to the database. However, we can recover these cases by using tolerant search (done automatically).

Two search modes are available:

Independent of whether exact or tolerant search is used, we require ambiguous amino acids in peptide sequences to match exactly in the protein DB (i.e. 'X' in a peptide only matches 'X' in the database).

The exact mode is much faster (about 10 times) and consumes less memory (about 2.5 times), but might fail to report a few protein hits with ambiguous amino acids for some peptides. Usually these proteins are putative, however. The exact mode also supports usage of multiple threads (threads option) to speed up computation even further, at the cost of some memory. This is only for the exact search (Aho-Corasick algorithm), however. If tolerant searching needs to be done for unassigned peptides, the latter will consume the major share of the runtime.

Further complications can arise due to the presence of the isobaric amino acids isoleucine ('I') and leucine ('L') in protein sequences. Since the two have the exact same chemical composition and mass, they generally cannot be distinguished by mass spectrometry. If a peptide containing 'I' was reported as a match for a spectrum, a peptide containing 'L' instead would be an equally good match (and vice versa). To account for this inherent ambiguity, setting the flag IL_equivalent causes 'I' and 'L' to be considered as indistinguishable.
For example, if the sequence "PEPTIDE" (matching "Protein1") was identified as a search hit, but the database additionally contained "PEPTLDE" (matching "Protein2"), running PeptideIndexer with the IL_equivalent option would report both "Protein1" and "Protein2" as accessions for "PEPTIDE". (This is independent of the error-tolerant search controlled by full_tolerant_search and aaa_max.)

Once a peptide sequence is found in a protein sequence, this does not imply that the hit is valid! This is where enzyme specificity comes into play. By default, we demand that the peptide is fully tryptic (i.e. the enzyme parameter is set to "trypsin" and specificity is "full"). So unless the peptide coincides with C- and/or N-terminus of the protein, the peptide's cleavage pattern should fulfill the trypsin cleavage rule [KR][^P]. We make one exception for peptides starting at the second amino acid of a protein if the first amino acid of that protein is methionine (M), which is usually cleaved off in vivo. For example, the two peptides AAAR and MAAAR would both match a protein starting with MAAAR.

You can relax the requirements further by choosing semi-tryptic (only one of two "internal" termini must match requirements) or none (essentially allowing all hits, no matter their context).

Note
Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

PeptideIndexer -- Refreshes the protein references for all peptide hits.
Version: 2.0.0 Aug 19 2015, 22:19:33, Revision: GIT-NOTFOUND

Usage:
  PeptideIndexer <options>

Options (mandatory options marked with '*'):
  -in <file>*                     Input idXML file containing the identifications. (valid formats: 'idXML')
  -fasta <file>*                  Input sequence database in FASTA format. Non-existing relative filenames 
                                  are looked up via 'OpenMS.ini:id_db_dir' (valid formats: 'fasta')
  -out <file>*                    Output idXML file. (valid formats: 'idXML')
  -decoy_string <string>          String that was appended (or prefixed - see 'prefix' flag below) to the 
                                  accessions in the protein database to indicate decoy proteins. (default:
                                  '_rev')
  -missing_decoy_action <action>  Action to take if NO peptide was assigned to a decoy protein (which indicat
                                  es wrong database or decoy string): 'error' (exit with error, no output),
                                  'warn' (exit with success, warning message) (default: 'error' valid: 'error
                                  ', 'warn')

The enzyme determines valid cleavage sites; cleavage specificity determines to what extent validity is enforc
ed.:
  -enzyme:name                    Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after 
                                  lysine (K) or arginine (R), but not before proline (P). (default: 'Trypsin'
                                  valid: 'Trypsin', 'Trypsin/P')
  -enzyme:specificity             Specificity of the enzyme.
                                  'full': both internal cleavage sites must match.
                                  'semi': one of two internal cleavage sites must match.
                                  'none': allow all peptide hits no matter their context. Therefore, the e
                                  nzyme chosen does not play a role here (default: 'full' valid: 'full', 'sem
                                  i', 'none')

  -prefix                         If set, protein accessions in the database contain 'decoy_string' as prefix
                                  .
  -annotate_proteins              If set, add target/decoy information to proteins (as well as peptides).
  -write_protein_sequence         If set, the protein sequences are stored as well.
  -write_protein_description      If set, the protein description is stored as well.
  -keep_unreferenced_proteins     If set, protein hits which are not referenced by any peptide are kept.
  -allow_unmatched                If set, unmatched peptide sequences are allowed. By default (i.e. if this 
                                  flag is not set) the program terminates with an error on unmatched peptides
                                  .
  -full_tolerant_search           If set, all peptide sequences are matched using tolerant search. Thus poten
                                  tially more proteins (containing ambiguous amino acids) are associated.
                                  This is much slower!
  -aaa_max <number>               Maximal number of ambiguous amino acids (AAA) allowed when matching to a 
                                  protein database with AAA's. AAA's are 'B', 'Z' and 'X' (default: '4' min:
                                  '0')
  -IL_equivalent                  Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equiva
                                  lent (indistinguishable)
                                  
Common TOPP options:
  -ini <file>                     Use the given TOPP INI file
  -threads <n>                    Sets the number of threads allowed to be used by the TOPP tool (default: 
                                  '1')
  -write_ini <file>               Writes the default configuration file
  --help                          Shows options
  --helphelp                      Shows all options (including advanced)

INI file documentation of this tool:

Legend:
required parameter
advanced parameter
+PeptideIndexerRefreshes the protein references for all peptide hits.
version2.0.0 Version of the tool that generated this parameters file.
++1Instance '1' section for 'PeptideIndexer'
in Input idXML file containing the identifications.input file*.idXML
fasta Input sequence database in FASTA format. Non-existing relative filenames are looked up via 'OpenMS.ini:id_db_dir'input file*.fasta
out Output idXML file.output file*.idXML
decoy_string_rev String that was appended (or prefixed - see 'prefix' flag below) to the accessions in the protein database to indicate decoy proteins.
missing_decoy_actionerror Action to take if NO peptide was assigned to a decoy protein (which indicates wrong database or decoy string): 'error' (exit with error, no output), 'warn' (exit with success, warning message)error,warn
prefixfalse If set, protein accessions in the database contain 'decoy_string' as prefix.true,false
annotate_proteinsfalse If set, add target/decoy information to proteins (as well as peptides).true,false
write_protein_sequencefalse If set, the protein sequences are stored as well.true,false
write_protein_descriptionfalse If set, the protein description is stored as well.true,false
keep_unreferenced_proteinsfalse If set, protein hits which are not referenced by any peptide are kept.true,false
allow_unmatchedfalse If set, unmatched peptide sequences are allowed. By default (i.e. if this flag is not set) the program terminates with an error on unmatched peptides.true,false
full_tolerant_searchfalse If set, all peptide sequences are matched using tolerant search. Thus potentially more proteins (containing ambiguous amino acids) are associated. This is much slower!true,false
aaa_max4 Maximal number of ambiguous amino acids (AAA) allowed when matching to a protein database with AAA's. AAA's are 'B', 'Z' and 'X'0:∞
IL_equivalentfalse Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equivalent (indistinguishable)true,false
log Name of log file (created only when specified)
debug0 Sets the debug level
threads1 Sets the number of threads allowed to be used by the TOPP tool
no_progressfalse Disables progress logging to command linetrue,false
forcefalse Overwrite tool specific checks.true,false
testfalse Enables the test mode (needed for internal use only)true,false
+++enzymeThe enzyme determines valid cleavage sites; cleavage specificity determines to what extent validity is enforced.
nameTrypsin Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after lysine (K) or arginine (R), but not before proline (P).Trypsin,Trypsin/P
specificityfull Specificity of the enzyme.
'full': both internal cleavage sites must match.
'semi': one of two internal cleavage sites must match.
'none': allow all peptide hits no matter their context. Therefore, the enzyme chosen does not play a role here
full,semi,none

OpenMS / TOPP release 2.0.0 Documentation generated on Thu Aug 20 2015 01:44:31 using doxygen 1.8.9.1