                           CATHPARSE documentation



CONTENTS

   1.0 SUMMARY
   2.0 INPUTS & OUTPUTS
   3.0 INPUT FILE FORMAT
   4.0 OUTPUT FILE FORMAT
   5.0 DATA FILES
   6.0 USAGE
   7.0 KNOWN BUGS & WARNINGS
   8.0 NOTES
   9.0 DESCRIPTION
   10.0 ALGORITHM
   11.0 RELATED APPLICATIONS
   12.0 DIAGNOSTIC ERROR MESSAGES
   13.0 AUTHORS
   14.0 REFERENCES

1.0 SUMMARY

   Generate DCF file from raw CATH files

2.0 INPUTS & OUTPUTS

   CATHPARSE parses the CATH classification files, e.g. caths.list.v2.4,
   domlist.v2.4 and CAT.names.all.v2.4. These files are available by
   anonymous ftp from ftp.biochem.ucl.ac.uk (e.g. /pub/cathdata/v2.4) The
   format of these files is explained in the README file available there.
   CATHPARSE writes the CATH classification to a DCF file (EMBL-like
   format). No changes are made to the data other than changing the format
   in which it is held. The input and output files are specified by the
   user.

3.0 INPUT FILE FORMAT

   An excerpt from the raw CATH classification files, of the type
   caths.list.vX.X (Figure 1), domlist.vX.X (Figure 2) and
   CAT.names.all.vX.X (Figure 3) is shown below. The format of these files
   is explained in the CATH README file available by anonymous ftp from
   ftp.biochem.ucl.ac.uk (e.g. /pub/cathdata/v2.4).

  Input files for usage example

  File: caths.list.small

1cuk03    1  10   8  10   1   1   1  48 1.900
1hjp03    1  10   8  10   1   1   2  44 2.500

  File: domlist.small

1cuk00  D03   F00    1  0   1 - 0  66 -    1  0  67 - 0 142 -    1  0 156 - 0 20
3 -
1hjp00 D03 F01  1  0    1 - 0   66 -  1  0   67 - 0  158 -  1  0  159 - 0  202 -
  0  203 - 0  203 - (1)

  File: CAT.names.all.small

1.10.8           1cuk03            :Helicase, Ruva Protein, domain 3
1.10.8.10         1cuk03    :DNA helicase RuvA subunit, C-terminal domain
0001             2ccyA0        :Mainly Alpha
0001.0010        1eca00          :Orthogonal Bundle

4.0 OUTPUT FILE FORMAT

   An example of the DCF output file is shown in Figure 4. The records
   used to describe an entry are as follows. Records (5) to (8) are used
   to describe the position of the domain in the CATH hierarchy.
     * (1) ID - Domain identifier code. This is a 6-character code that
       uniquely identifies the domain in CATH. The first four characters
       are the PDB identifier code, the fifth character is the PDB chain
       identifier to which the domain belongs and the final character is
       the number of the domain in the chain (for chains comprising more
       than one domain). This character is '0' if the chain comprises a
       single domain only.
     * (2) EN - PDB identifier code. This is the 4-character PDB
       identifier code of the PDB entry containing the domain.
     * (3) TY - domain type. "CATH" or "SCOP" is given ("CATH" for DCF
       files generated by using CATHPARSE).
     * (4) CI - CATH Classification Numbers. The integers preceeding the
       codes CL, AR, TP, SF, FA, NI, IF are the CATH classification
       numbers for CLass, ARchitecture, ToPology, Homologous SuPerfamily,
       FAmily, Near Identical family and Identical Family respectively.
       These numbers uniquely identify the appropriate node in the CATH
       parsable files.
     * (5) CL - Class. It is the identical text taken from
       CAT.names.all.vX.X.
     * (6) AR - Architecture. It is the identical text taken from
       CAT.names.all.vX.X.
     * (7) TP - Topology. It is the identical text taken from
       CAT.names.all.vX.X.
     * (8) SF - Homologous Superfamily. It is the identical text taken
       from CAT.names.all.vX.X.
     * (9) DS - Sequence of the domain according to the PDB file. This
       sequence is taken from the domain CCF file (clean coordinate file)
       generated by DOMAINER. The DS record will only be present if the
       DCF file has been processed using DOMAINSEQS.
     * (10) NR - Number of residues in domain
     * (11) NC - Number of segments comprising the domain. All domains in
       CATH are from single chains. If the number of segments is greater
       than 1, then the domain entry will have a section containing a CN
       and a CH record (see below) for each segment.
     * (12) CN - Segment number. The number given in brackets after this
       record indicates the start of the data for the relevent segment.
     * (13) CH - Domain definition. The character given before CHAIN is
       the PDB chain identifier, the strings before START and END give the
       start and end positions respectively of the domain in the PDB file
       (a '.' is given in cases where a position was not specified). Note
       that the start and end positions refer to residue numbering given
       in the original pdb file and therefore must be treated as strings.
       (14) XX - used for spacing. (15) // - used to delimit records for a
       domain.

  Output files for usage example

  File: Ecath.dat

ID   1CUK03
XX
EN   1CUK
XX
TY   CATH
XX
CI   1 CL; 10 AR; 8 TP; 10 SF; 1 FA; 1 NI;1 IF;
XX
CL   Mainly Alpha
XX
AR   Orthogonal Bundle
XX
TP   Helicase, Ruva Protein, domain 3
XX
SF   DNA helicase RuvA subunit, C-terminal domain
XX
NR   48
XX
NC   1
XX
CN   [1]
XX
CH   0 CHAIN; 156 START; 203 END;
//
ID   1HJP03
XX
EN   1HJP
XX
TY   CATH
XX
CI   1 CL; 10 AR; 8 TP; 10 SF; 1 FA; 1 NI;2 IF;
XX
CL   Mainly Alpha
XX
AR   Orthogonal Bundle
XX
TP   Helicase, Ruva Protein, domain 3
XX
SF   DNA helicase RuvA subunit, C-terminal domain
XX
NR   44
XX
NC   1
XX
CN   [1]
XX
CH   0 CHAIN; 159 START; 202 END;
//

  File: cathparse.log

1.10.8.10
1.10.8
0001.0010
0001
1.10.8.10
1.10.8
0001.0010
0001

5.0 DATA FILES

   None.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

Generate DCF file from raw CATH files.
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-listfile]          infile     [caths.list.v2.4] This option specifies the
                                  name of raw CATH classification file
                                  (caths.list.vX.X) (input). The raw CATH
                                  parsable files (classification and
                                  description files) available from
                                  ftp.biochem.ucl.ac.uk (/pub/cathdata/v2.4).
  [-domfile]           infile     [domlist.v2.4] This option specifies the
                                  name of raw CATH classification file
                                  (domlist.vX.X) (input). The raw CATH
                                  parsable files (classification and
                                  description files) available from
                                  ftp.biochem.ucl.ac.uk (/pub/cathdata/v2.4).
  [-namesfile]         infile     [CAT.names.all.v2.4] This option specifies
                                  the name of raw CATH classification file
                                  (CAT.names.all.vX.X) (input). The raw CATH
                                  parsable files (classification and
                                  description files) available from
                                  ftp.biochem.ucl.ac.uk (/pub/cathdata/v2.4).
  [-outfile]           outfile    [Ecath.dat] This option specifies the name
                                  of CATH DCF file (domain classification
                                  file) (output). A 'domain classification
                                  file' contains classification and other data
                                  for domains from SCOP or CATH, in DCF
                                  format (EMBL-like). The files are generated
                                  by using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -logfile            outfile    [cathparse.log] This option specifies the
                                  name of the CATHPARSE log file.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory4        string     Output directory

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


