
                               skipredundant 



Function

   Remove redundant sequences from an input set

Description

   Redundancy in a database or other collection of sequences occurs when
   one or more similar sequences are present. The inclusion of very
   similar sequences in certain analyses will introduce undesirable bias.
   For example, a family may possess 100 sequences in the sequence
   database, but 90 of these might be essentially the same sequence, e.g.
   very close relatives or mutations of a single sequence. Although 100
   sequences are known, the family only contains 11 sequences that are
   essentially unique. For many applications it is desirable or even
   essential to remove redundant sequences from a set in order to produce
   a smaller set that is representative of the whole. SEQNR removes
   redundancy from an input file of sequences, either at a single
   threshold of sequence similiarty (e.g. 40%) or within a threshold
   range of sequence similiarty (e.g. 40% - 70%).

Algorithm

   Redundancy is calculated for each pair of sequences in turn.

Usage

   Here is a sample session with skipredundant


% skipredundant -threshold 80 -keepredundant 
Remove redundant sequences from an input set
Input sequence set: globins.fasta
Redundancy removal options
         1 : Single threshold percentage sequence similarity
         2 : Outside a range of acceptable threshold percentage similarities
Select number [1]: 
Gap opening penalty [10.0]: 
Gap extension penalty [0.5]: 
output sequence(s) [globins.keep]: 
Second output sequence(s) [globins.redundant]: 

   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-sequences]         seqset     Sequence set filename and optional format,
                                  or reference (input USA)
   -mode               menu       [1] This option specifies whether to remove
                                  redundancy at a single threshold percentage
                                  sequence similarity or remove redundancy
                                  outside a range of acceptable threshold
                                  percentage similarity. All permutations of
                                  pair-wise sequence alignments are calculated
                                  for each set of input sequences in turn
                                  using the EMBOSS implementation of the
                                  Needleman and Wunsch global alignment
                                  algorithm. Redundant sequences are removed
                                  in one of two modes as follows: (i) If a
                                  pair of proteins achieve greater than a
                                  threshold percentage sequence similarity
                                  (specified by the user) the shortest
                                  sequence is discarded. (ii) If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range (specified by the user) the shortest
                                  sequence is discarded. (Values: 1 (Single
                                  threshold percentage sequence similarity); 2
                                  (Outside a range of acceptable threshold
                                  percentage similarities))
*  -threshold          float      [95.0] This option specifies the percentage
                                  sequence identity redundancy threshold. The
                                  percentage sequence identity redundancy
                                  threshold determines the redundancy
                                  calculation. If a pair of proteins achieve
                                  greater than this threshold the shortest
                                  sequence is discarded. (Any numeric value)
*  -minthreshold       float      [30.0] This option specifies the percentage
                                  sequence identity redundancy threshold
                                  (lower limit). The percentage sequence
                                  identity redundancy threshold determines the
                                  redundancy calculation. If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
                                  (Any numeric value)
*  -maxthreshold       float      [90.0] This option specifies the percentage
                                  sequence identity redundancy threshold
                                  (upper limit). The percentage sequence
                                  identity redundancy threshold determines the
                                  redundancy calculation. If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
                                  (Any numeric value)
   -gapopen            float      [10.0 for any sequence] The gap open penalty
                                  is the score taken away when a gap is
                                  created. The best value depends on the
                                  choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
                                  (Floating point number from 1.0 to 100.0)
   -gapextend          float      [0.5 for any sequence] The gap extension,
                                  penalty is added to the standard gap penalty
                                  for each base or residue in the gap. This
                                  is how long gaps are penalized. Usually you
                                  will expect a few long gaps rather than many
                                  short gaps, so the gap extension penalty
                                  should be lower than the gap penalty. An
                                  exception is where one or both sequences are
                                  single reads with possible sequencing
                                  errors in which case you would expect many
                                  single base gaps. You can get this result by
                                  setting the gap open penalty to zero (or
                                  very low) and using the gap extension
                                  penalty to control gap scoring. (Floating
                                  point number from 0.0 to 10.0)
  [-outseq]            seqoutall  [.] Sequence set(s)
                                  filename and optional format (output USA)
  [-redundantoutseq]   seqoutall  [.] Sequence set(s)
                                  filename and optional format (output USA)

   Additional (Optional) qualifiers:
   -datafile           matrixf    [EBLOSUM62 for protein, EDNAFULL for DNA]
                                  This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.

   Advanced (Unprompted) qualifiers:
   -feature            toggle     Sequence feature information will be
                                  retained if this option is set.
   -keepredundant      toggle     [N] This option specifies whether to retain
                                  redundant sequences. If this option is set a
                                  file of redundant sequences is written.

   Associated qualifiers:

   "-sequences" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outseq" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   "-redundantoutseq" associated qualifiers
   -osformat3          string     Output seq format
   -osextension3       string     File name extension
   -osname3            string     Base file name
   -osdirectory3       string     Output directory
   -osdbname3          string     Database name to add
   -ossingle3          boolean    Separate file for each entry
   -oufo3              string     UFO features
   -offormat3          string     Features format
   -ofname3            string     Features file name
   -ofdirectory3       string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Input file format

   skipredundant reads any normal sequence USAs.

  Input files for usage example

  File: globins.fasta

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

Output file format

   skipredundant outputs a graph to the specified graphics device.
   outputs a report format file. The default format is ...

  Output files for usage example

  File: globins.keep

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

  File: globins.redundant

>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR

Data files

   For protein sequences EBLOSUM62 is used for the substitution matrix.
   For nucleotide sequence, EDNAFULL is used. Others can be specified.

   EMBOSS data files are distributed with the application and stored in
   the standard EMBOSS data directory, which is defined by the EMBOSS
   environment variable EMBOSS_DATA.

   To see the available EMBOSS data files, run:

% embossdata -showall

   To fetch one of the data files (for example 'Exxx.dat') into your
   current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

   Users can provide their own data files in their own directories.
   Project specific files can be put in the current directory, or for
   tidier directory listings in a subdirectory called ".embossdata".
   Files for all EMBOSS runs can be put in the user's home directory, or
   again in a subdirectory called ".embossdata".

   The directories are searched in the following order:
     * . (your current directory)
     * .embossdata (under your current directory)
     * ~/ (your home directory)
     * ~/.embossdata

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

   Program name Description
   aligncopy Reads and writes alignments
   aligncopypair Reads and writes pairs from alignments
   biosed Replace or delete sequence sections
   codcopy Copy and reformat a codon usage table
   cutseq Removes a section from a sequence
   degapseq Removes non-alphabetic (e.g. gap) characters from sequences
   descseq Alter the name or description of a sequence
   entret Retrieves sequence entries from flatfile databases and files
   extractalign Extract regions from a sequence alignment
   extractfeat Extract features from sequence(s)
   extractseq Extract regions from a sequence
   featcopy Reads and writes a feature table
   featreport Reads and writes a feature table
   listor Write a list file of the logical OR of two sets of sequences
   makenucseq Create random nucleotide sequences
   makeprotseq Create random protein sequences
   maskambignuc Masks all ambiguity characters in nucleotide sequences
   with N
   maskambigprot Masks all ambiguity characters in protein sequences with
   X
   maskfeat Write a sequence with masked features
   maskseq Write a sequence with masked regions
   newseq Create a sequence file from a typed-in sequence
   nohtml Remove mark-up (e.g. HTML tags) from an ASCII text file
   noreturn Remove carriage return from ASCII files
   nospace Remove all whitespace from an ASCII text file
   notab Replace tabs with spaces in an ASCII text file
   notseq Write to file a subset of an input stream of sequences
   nthseq Write to file a single sequence from an input stream of
   sequences
   pasteseq Insert one sequence into another
   revseq Reverse and complement a nucleotide sequence
   seqret Reads and writes (returns) sequences
   seqretsplit Reads sequences and writes them to individual files
   sizeseq Sort sequences by size
   skipseq Reads and writes (returns) sequences, skipping first few
   splitter Split sequence(s) into smaller sequences
   trimest Remove poly-A tails from nucleotide sequences
   trimseq Remove unwanted characters from start and end of sequence(s)
   trimspace Remove extra whitespace from an ASCII text file
   union Concatenate multiple sequences into a single sequence
   vectorstrip Removes vectors from the ends of nucleotide sequence(s)
   yank Add a sequence reference (a full USA) to a list file

Author(s)

   Jon Ison (jison  ebi.ac.uk)
   European Bioinformatics Institute, Wellcome Trust Genome Campus,
   Hinxton, Cambridge CB10 1SD, UK

History

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.

Comments

   None
