Bio::Tools GuessSeqFormat
SummaryIncluded librariesPackage variablesSynopsisDescriptionGeneral documentationMethods
Toolbar
WebCvs
Summary
Bio::Tools::GuessSeqFormat - Module for determining the sequence
format of the contents of a file, a string, or through a
filehandle.
Package variables
No package variables defined.
Inherit
Bio::Root::Root
Synopsis
    # To guess the format of a flat file, given a filename:
my $guesser = Bio::Tools::GuessSeqFormat->new( -file => $filename );
my $format = $guesser->guess;
# To guess the format from an already open filehandle: my $guesser = Bio::Tools::GuessSeqFormat->new( -fh => $filehandle ); my $format = $guesser->guess; # If the filehandle is seekable (STDIN isn't), it will be # returned to its original position. # To guess the format of one or several lines of text (with # embedded newlines): my $guesser = Bio::Tools::GuessSeqFormat->new( -text => $linesoftext ); my $format = $guesser->guess; # To create a Bio::Tools::GuessSeqFormat object and set the # filename, filehandle, or line to parse afterwards: my $guesser = Bio::Tools::GuessSeqFormat->new(); $guesser->file($filename); $guesser->fh($filehandle); $guesser->text($linesoftext); # To guess in one go, given e.g. a filename: my $format = Bio::Tools::GuessSeqFormat->new( -file => $filename )->guess;
Description
Bio::Tools::GuessSeqFormat tries to guess the format ("swiss",
"pir", "fasta" etc.) of the sequence or MSA in a file, in a
scalar, or through a filehandle.
The guess() method of a Bio::Tools::GuessSeqFormat object will
examine the data, line by line, until it finds a line to which
only one format can be assigned. If no conclusive guess can be
made, undef is returned.
If the Bio::Tools::GuessSeqFormat object is given a filehandle
which is seekable, it will be restored to its original position
on return from the guess() method. Tests are currently implemented for the following formats:
    *(1)
    ACeDB ("ace")
    *(2)
    Blast ("blast")
    *(3)
    ClustalW ("clustalw")
    *(4)
    Codata ("codata")
    *(5)
    EMBL ("embl")
    *(6)
    FastA sequence ("fasta")
    *(7)
    FastQ sequence ("fastq")
    *(8)
    FastXY/FastA alignment ("fastxy")
    *(9)
    Game XML ("game")
    *(10)
    GCG ("gcg")
    *(11)
    GCG Blast ("gcgblast")
    *(12)
    GCG FastA ("gcgfasta")
    *(13)
    GDE ("gde")
    *(14)
    Genbank ("genbank")
    *(15)
    Genscan ("genscan")
    *(16)
    GFF ("gff")
    *(17)
    HMMER ("hmmer")
    *(18)
    PAUP/NEXUS ("nexus")
    *(19)
    Phrap assembly file ("phrap")
    *(20)
    NBRF/PIR ("pir")
    *(21)
    Mase ("mase")
    *(22)
    Mega ("mega")
    *(23)
    GCG/MSF ("msf")
    *(24)
    Pfam ("pfam")
    *(25)
    Phylip ("phylip")
    *(26)
    Prodom ("prodom")
    *(27)
    Raw ("raw")
    *(28)
    RSF ("rsf")
    *(29)
    Selex ("selex")
    *(30)
    Stockholm ("stockholm")
    *(31)
    Swissprot ("swiss")
    *(32)
    Tab ("tab")
    *(33)
    Variant Call Format ("vcf")
Methods
newDescriptionCode
fileDescriptionCode
fhDescriptionCode
textDescriptionCode
guessDescriptionCode
_possibly_aceDescriptionCode
_possibly_blastDescriptionCode
_possibly_bowtieDescriptionCode
_possibly_clustalwDescriptionCode
_possibly_codataDescriptionCode
_possibly_emblDescriptionCode
_possibly_fastaDescriptionCode
_possibly_fastqDescriptionCode
_possibly_fastxyDescriptionCode
_possibly_gameDescriptionCode
_possibly_gcgDescriptionCode
_possibly_gcgblastDescriptionCode
_possibly_gcgfastaDescriptionCode
_possibly_gdeDescriptionCode
_possibly_genbankDescriptionCode
_possibly_genscanDescriptionCode
_possibly_gffDescriptionCode
_possibly_hmmerDescriptionCode
_possibly_nexusDescriptionCode
_possibly_maseDescriptionCode
_possibly_megaDescriptionCode
_possibly_msfDescriptionCode
_possibly_phrapDescriptionCode
_possibly_pirDescriptionCode
_possibly_pfamDescriptionCode
_possibly_phylipDescriptionCode
_possibly_prodomDescriptionCode
_possibly_rawDescriptionCode
_possibly_rsfDescriptionCode
_possibly_selexDescriptionCode
_possibly_stockholmDescriptionCode
_possibly_swissDescriptionCode
_possibly_tabDescriptionCode
_possibly_vcfDescriptionCode
Methods description
newcode    nextTop
 Title      : new
Usage : $guesser = Bio::Tools::GuessSeqFormat->new( ... );
Function : Creates a new object.
Example : See SYNOPSIS.
Returns : A new object.
Arguments : -file The filename of the file whose format is to
be guessed, or
-fh An already opened filehandle from which a text
stream may be read, or
-text A scalar containing one or several lines of
text with embedded newlines.
If more than one of the above arguments are given, they are tested in the order -text, -file, -fh, and the first available argument will be used.
filecodeprevnextTop
 Title      : file
Usage : $guesser->file($filename);
$filename = $guesser->file;
Function : Gets or sets the current filename associated with
an object.
Returns : The new filename.
Arguments : The filename of the file whose format is to be
guessed.
A call to this method will clear the current filehandle and the current lines of text associated with the object.
fhcodeprevnextTop
 Title      : fh
Usage : $guesser->fh($filehandle);
$filehandle = $guesser->fh;
Function : Gets or sets the current filehandle associated with
an object.
Returns : The new filehandle.
Arguments : An already opened filehandle from which a text
stream may be read.
A call to this method will clear the current filename and the current lines of text associated with the object.
textcodeprevnextTop
 Title      : text
Usage : $guesser->text($linesoftext);
$linesofext = $guesser->text;
Function : Gets or sets the current text associated with an
object.
Returns : The new lines of texts.
Arguments : A scalar containing one or several lines of text,
including embedded newlines.
A call to this method will clear the current filename and the current filehandle associated with the object.
guesscodeprevnextTop
 Title      : guess
Usage : $format = $guesser->guess;
@format = $guesser->guess; # if given a line of text
Function : Guesses the format of the data accociated with the
object.
Returns : A format string such as "swiss" or "pir". If a
format can not be found, undef is returned.
Arguments : None.
If the object is associated with a filehandle and if that filehandle is searchable, the position of the filehandle will be returned to its original position before the method returns.
_possibly_acecodeprevnextTop
From bioperl test data, and from
"http://www.isrec.isb-sib.ch/DEA/module8/B_Stevenson/Practicals/transcriptome_recon/transcriptome_recon.html".
_possibly_blastcodeprevnextTop
 From various blast results.
_possibly_bowtiecodeprevnextTop
Contributed by kortsch.
_possibly_clustalwcodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
_possibly_codatacodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
_possibly_emblcodeprevnextTop
From
"http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3.3".
_possibly_fastacodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
_possibly_fastqcodeprevnextTop
From bioperl test data.
_possibly_fastxycodeprevnextTop
From bioperl test data.
_possibly_gamecodeprevnextTop
From bioperl testdata.
_possibly_gcgcodeprevnextTop
From bioperl, Bio::SeqIO::gcg.
_possibly_gcgblastcodeprevnextTop
From bioperl testdata.
_possibly_gcgfastacodeprevnextTop
From bioperl testdata.
_possibly_gdecodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
_possibly_genbankcodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
Format of [apparantly optional] file header from
"http://www.umdnj.edu/rcompweb/PA/Notes/GenbankFF.htm". (TODO: dead link)
_possibly_genscancodeprevnextTop
From bioperl test data.
_possibly_gffcodeprevnextTop
From bioperl test data.
_possibly_hmmercodeprevnextTop
From bioperl test data.
_possibly_nexuscodeprevnextTop
From "http://paup.csit.fsu.edu/nfiles.html".
_possibly_masecodeprevnextTop
From bioperl test data.
More detail from "http://www.umdnj.edu/rcompweb/PA/Notes/GenbankFF.htm" (TODO: dead link)
_possibly_megacodeprevnextTop
From the ensembl broswer (AlignView data export).
_possibly_msfcodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
_possibly_phrapcodeprevnextTop
From "http://biodata.ccgb.umn.edu/docs/contigimage.html". (TODO: dead link)
From "http://genetics.gene.cwru.edu/gene508/Lec6.htm". (TODO: dead link)
From bioperl test data ("*.ace.1" files).
_possibly_pircodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
The ".,()" spotted in bioperl test data.
_possibly_pfamcodeprevnextTop
From bioperl test data.
_possibly_phylipcodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html". Initial space
allowed on first line (spotted in ensembl AlignView exported
data).
_possibly_prodomcodeprevnextTop
From "http://prodom.prabi.fr/prodom/current/documentation/data.php".
_possibly_rawcodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
_possibly_rsfcodeprevnextTop
From "http://www.ebi.ac.uk/help/formats.html".
_possibly_selexcodeprevnextTop
From "http://www.ebc.ee/WWW/hmmer2-html/node27.html".
Assuming presence of Selex file header. Data exported by
Bioperl on Pfam and Selex formats are identical, but Pfam file
only holds one alignment.
_possibly_stockholmcodeprevnextTop
From bioperl test data.
_possibly_swisscodeprevnextTop
From "http://ca.expasy.org/sprot/userman.html#entrystruc".
_possibly_tabcodeprevnextTop
Contributed by Heikki.
_possibly_vcfcodeprevnextTop
From "http://www.1000genomes.org/wiki/analysis/vcf4.0".
Assumptions made about sanity - format and date lines are line 1 and 2
respectively. This is not specified in the format document.
Methods code
newdescriptionprevnextTop
sub new {
    my $class = shift;
    my @args  = @_;

    my $self = $class->SUPER::new(@args);

    my $attr;
    my $value;

    while (@args) {
        $attr = shift @args;
        $attr = lc $attr;
        $value = shift @args;
        $self->{$attr} = $value;
    }

    return $self;
}
filedescriptionprevnextTop
sub file {
    # Sets and/or returns the filename to use.
my $self = shift; my $file = shift; if (defined $file) { # Set the active filename, and clear the filehandle and
# text line, if present.
$self->{-file} = $file; $self->{-fh} = $self->{-text} = undef; } return $self->{-file};
}
fhdescriptionprevnextTop
sub fh {
    # Sets and/or returns the filehandle to use.
my $self = shift; my $fh = shift; if (defined $fh) { # Set the active filehandle, and clear the filename and
# text line, if present.
$self->{-fh} = $fh; $self->{-file} = $self->{-text} = undef; } return $self->{-fh};
}
textdescriptionprevnextTop
sub text {
    # Sets and/or returns the text lines to use.
my $self = shift; my $text = shift; if (defined $text) { # Set the active text lines, and clear the filehandle
# and filename, if present.
$self->{-text} = $text; $self->{-fh} = $self->{-file} = undef; } return $self->{-text};
}
guessdescriptionprevnextTop
sub guess {
    my $self = shift;

    foreach my $fmt_key (keys %formats) {
        $formats{$fmt_key}{fmt_string} = $fmt_key;
    }

    my $fh;
    my $start_pos;
    my @lines;
    if (defined $self->{-text}) {
	# Break the text into separate lines.
@lines = split /\n/, $self->{-text}; } elsif (defined $self->{-file}) { # If given a filename, open the file.
open($fh, $self->{-file}) or $self->throw("Can not open '$self->{-file}' for reading: $!"); } elsif (defined $self->{-fh}) { # If given a filehandle, figure out if it's a plain GLOB
# or a IO::Handle which is seekable. In the case of a
# GLOB, we'll assume it's seekable. Get the current
# position in the stream.
$fh = $self->{-fh}; if (ref $fh eq 'GLOB') { $start_pos = tell($fh); } elsif (UNIVERSAL::isa($fh, 'IO::Seekable')) { $start_pos = $fh->getpos(); } } my $done = 0; my $lineno = 0; my $fmt_string; while (!$done) { my $line; # The next line of the file.
my $match = 0; # Number of possible formats of this line.
if (defined $self->{-text}) { last if (scalar @lines == 0); $line = shift @lines; } else { last if (!defined($line = <$fh>)); } next if ($line =~ /^\s*$/); # Skip white and empty lines.
chomp($line); $line =~ s/\r$//; # Fix for DOS files on Unix.
++$lineno; while (my ($fmt_key, $fmt) = each (%formats)) { if ($fmt->{test}($line, $lineno)) { ++$match; $fmt_string = $fmt->{fmt_string}; } } # We're done if there was only one match.
$done = ($match == 1); } if (defined $self->{-file}) { # Close the file we opened.
close($fh); } elsif (ref $fh eq 'GLOB') { # Try seeking to the start position.
seek($fh, $start_pos, 0) || $self->throw("Failed resetting the ". "filehandle; IO error occurred");; } elsif (defined $fh && $fh->can('setpos')) { # Seek to the start position.
$fh->setpos($start_pos); } return ($done ? $fmt_string : undef);
}
_possibly_acedescriptionprevnextTop
sub _possibly_ace {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ /^(?:Sequence|Peptide|DNA|Protein) [":]/);
}
_possibly_blastdescriptionprevnextTop
sub _possibly_blast {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 &&
        $line =~ /^[[:upper:]]*BLAST[[:upper:]]*.*\[.*\]$/);
}
_possibly_bowtiedescriptionprevnextTop
sub _possibly_bowtie {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ /^[[:graph:]]+\t[-+]\t[[:graph:]]+\t\d+\t([[:alpha:]]+)\t([[:graph:]]+)\t\d+\t[[:graph:]]?/)
            && length($1)==length($2);
}
_possibly_clustalwdescriptionprevnextTop
sub _possibly_clustalw {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 && $line =~ /CLUSTAL/);
}
_possibly_codatadescriptionprevnextTop
sub _possibly_codata {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^ENTRY/) ||
            ($lineno == 2 && $line =~ /^SEQUENCE/) ||
            $line =~ m{^(?:ENTRY|SEQUENCE|///)});
}
_possibly_embldescriptionprevnextTop
sub _possibly_embl {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 && $line =~ /^ID   / && $line =~ /BP\.$/);
}
_possibly_fastadescriptionprevnextTop
sub _possibly_fasta {
    my ($line, $lineno) = (shift, shift);
    return (($lineno != 1 && $line =~ /^[A-IK-NP-Z]+$/i) ||
            $line =~ /^>\s*\w/);
}
_possibly_fastqdescriptionprevnextTop
sub _possibly_fastq {
    my ($line, $lineno) = (shift, shift);
    return ( ($lineno == 1 && $line =~ /^@/) ||
	     ($lineno == 3 && $line =~ /^\+/) );
}
_possibly_fastxydescriptionprevnextTop
sub _possibly_fastxy {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^ FAST(?:XY|A)/) ||
            ($lineno == 2 && $line =~ /^ version \d/));
}
_possibly_gamedescriptionprevnextTop
sub _possibly_game {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ /^<!DOCTYPE game/);
}
_possibly_gcgdescriptionprevnextTop
sub _possibly_gcg {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ /Length: .*Type: .*Check: .*\.\.$/);
}
_possibly_gcgblastdescriptionprevnextTop
sub _possibly_gcgblast {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^!!SEQUENCE_LIST/) ||
            ($lineno == 2 &&
             $line =~ /^[[:upper:]]*BLAST[[:upper:]]*.*\[.*\]$/));
}
_possibly_gcgfastadescriptionprevnextTop
sub _possibly_gcgfasta {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^!!SEQUENCE_LIST/) ||
            ($lineno == 2 && $line =~ /FASTA/));
}
_possibly_gdedescriptionprevnextTop
sub _possibly_gde {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ /^[{}]$/ ||
            $line =~ /^(?:name|longname|sequence-ID|
creation-date|direction|strandedness|
type|offset|group-ID|creator|descrip|
comment|sequence)/x
);
}
_possibly_genbankdescriptionprevnextTop
sub _possibly_genbank {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /GENETIC SEQUENCE DATA BANK/) ||
            ($lineno == 1 && $line =~ /^LOCUS /) ||
            ($lineno == 2 && $line =~ /^DEFINITION /) ||
            ($lineno == 3 && $line =~ /^ACCESSION /));
}
_possibly_genscandescriptionprevnextTop
sub _possibly_genscan {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^GENSCAN.*Date.*Time/) ||
            ($line =~ /^(?:Sequence\s+\w+|Parameter matrix|Predicted genes)/));
}
_possibly_gffdescriptionprevnextTop
sub _possibly_gff {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^##gff-version/) ||
            ($lineno == 2 && $line =~ /^##date/));
}
_possibly_hmmerdescriptionprevnextTop
sub _possibly_hmmer {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 2 && $line =~ /^HMMER/) ||
            ($lineno == 3 &&
             $line =~ /Washington University School of Medicine/));
}
_possibly_nexusdescriptionprevnextTop
sub _possibly_nexus {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 && $line =~ /^#NEXUS/);
}
_possibly_masedescriptionprevnextTop
sub _possibly_mase {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^;;/) ||
            ($lineno > 1 && $line =~ /^;[^;]?/));
}
_possibly_megadescriptionprevnextTop
sub _possibly_mega {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 && $line =~ /^#mega$/);
}
_possibly_msfdescriptionprevnextTop
sub _possibly_msf {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ m{^//} ||
            $line =~ /MSF:.*Type:.*Check:|Name:.*Len:/);
}
_possibly_phrapdescriptionprevnextTop
sub _possibly_phrap {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ /^(?:AS\ |CO\ Contig|BQ|AF\ |BS\ |RD\ |
QA\ |DS\ |RT\{)/x
);
}
_possibly_pirdescriptionprevnextTop
sub _possibly_pir {
# "NBRF/PIR" (?){
my ($line, $lineno) = (shift, shift); return (($lineno != 1 && $line =~ /^[\sA-IK-NP-Z.,()]+\*?$/i) || $line =~ /^>(?:P1|F1|DL|DC|RL|RC|N3|N1);/);
}
_possibly_pfamdescriptionprevnextTop
sub _possibly_pfam {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ m{^\w+/\d+-\d+\s+[A-IK-NP-Z.]+}i);
}
_possibly_phylipdescriptionprevnextTop
sub _possibly_phylip {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^\s*\d+\s\d+/) ||
            ($lineno == 2 && $line =~ /^\w\s+[A-IK-NP-Z\s]+/) ||
            ($lineno == 3 && $line =~ /(?:^\w\s+[A-IK-NP-Z\s]+|\s+[A-IK-NP-Z\s]+)/)
           );
}
_possibly_prodomdescriptionprevnextTop
sub _possibly_prodom {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 && $line =~ /^ID   / && $line =~ /\d+ seq\.$/);
}
_possibly_rawdescriptionprevnextTop
sub _possibly_raw {
    my ($line, $lineno) = (shift, shift);
    return ($line =~ /^[A-Za-z\s]+$/);
}
_possibly_rsfdescriptionprevnextTop
sub _possibly_rsf {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^!!RICH_SEQUENCE/) ||
            $line =~ /^[{}]$/ ||
            $line =~ /^(?:name|type|longname|
checksum|creation-date|strand|sequence)/x
);
}
_possibly_selexdescriptionprevnextTop
sub _possibly_selex {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^#=ID /) ||
            ($lineno == 2 && $line =~ /^#=AC /) ||
            ($line =~ /^#=SQ /));
}
_possibly_stockholmdescriptionprevnextTop
sub _possibly_stockholm {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /^# STOCKHOLM/) ||
            $line =~ /^#=(?:GF|GS) /);
}
_possibly_swissdescriptionprevnextTop
sub _possibly_swiss {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 && $line =~ /^ID   / && $line =~ /AA\.$/);
}
_possibly_tabdescriptionprevnextTop
sub _possibly_tab {
    my ($line, $lineno) = (shift, shift);
    return ($lineno == 1 && $line =~ /^[^\t]+\t[^\t]+/) ;
}
_possibly_vcfdescriptionprevnextTop
sub _possibly_vcf {
    my ($line, $lineno) = (shift, shift);
    return (($lineno == 1 && $line =~ /##fileformat=VCFv/) ||
            ($lineno == 2 && $line =~ /##fileDate=/));
}



1;
}
General documentation
FEEDBACKTop
Mailing ListsTop
User feedback is an integral part of the evolution of this and
other Bioperl modules. Send your comments and suggestions
preferably to one of the Bioperl mailing lists. Your
participation is much appreciated.
  bioperl-l@bioperl.org                  - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Support Top
Please direct usage questions or support issues to the mailing list:
bioperl-l@bioperl.org
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
Reporting BugsTop
Report bugs to the Bioperl bug tracking system to help us
keep track the bugs and their resolution. Bug reports can be
submitted via the web:
  https://redmine.open-bio.org/projects/bioperl/
AUTHORTop
Andreas Kähäri, andreas.kahari@ebi.ac.uk
CONTRIBUTORSTop
Heikki Lehväslaiho, heikki-at-bioperl-dot-org
Mark A. Jensen, maj-at-fortinbras-dot-us
HELPER SUBROUTINESTop
All helper subroutines will, given a line of text and the line
number of the same line, return 1 if the line possibly is from a
file of the type that they perform a test of.
A zero return value does not mean that the line is not part
of a certain type of file, just that the test did not find any
characteristics of that type of file in the line.