Bio::Tools
GuessSeqFormat
Summary
Bio::Tools::GuessSeqFormat - Module for determining the sequence
format of the contents of a file, a string, or through a
filehandle.
Package variables
No package variables defined.
Inherit
Synopsis
# To guess the format of a flat file, given a filename:
my $guesser = new Bio::Tools::GuessSeqFormat( -file => $filename );
my $format = $guesser->guess;
# To guess the format from an already open filehandle:
my $guesser = new Bio::Tools::GuessSeqFormat( -fh => $filehandle );
my $format = $guesser->guess;
# If the filehandle is seekable (STDIN isn't), it will be
# returned to its original position.
# To guess the format of one or several lines of text (with
# embedded newlines):
my $guesser = new Bio::Tools::GuessSeqFormat( -text => $linesoftext );
my $format = $guesser->guess;
# To create a Bio::Tools::GuessSeqFormat object and set the
# filename, filehandle, or line to parse afterwards:
my $guesser = new Bio::Tools::GuessSeqFormat;
$guesser->file($filename);
$guesser->fh($filehandle);
$guesser->text($linesoftext);
# To guess in one go, given e.g. a filename:
my $format = new Bio::Tools::GuessSeqFormat( -file => $filename )->guess;
Description
Bio::Tools::GuessSeqFormat tries to guess the format ("swiss",
"pir", "fasta" etc.) of the sequence or MSA in a file, in a
scalar, or through a filehandle.
The guess() method of a
Bio::Tools::GuessSeqFormat object will
examine the data, line by line, until it finds a line to which
only one format can be assigned. If no conclusive guess can be
made, undef is returned.
If the
Bio::Tools::GuessSeqFormat object is given a filehandle
which is seekable, it will be restored to its original position
on return from the guess() method.
Tests are currently implemented for the following formats:
*(1)
ACeDB ("ace")
*(2)
Blast ("blast")
*(3)
ClustalW ("clustalw")
*(4)
Codata ("codata")
*(5)
EMBL ("embl")
*(6)
FastA sequence ("fasta")
*(7)
FastXY/FastA alignment ("fastxy")
*(8)
Game XML ("game")
*(9)
GCG ("gcg")
*(10)
GCG Blast ("gcgblast")
*(11)
GCG FastA ("gcgfasta")
*(12)
GDE ("gde")
*(13)
Genbank ("genbank")
*(14)
Genscan ("genscan")
*(15)
GFF ("gff")
*(16)
HMMER ("hmmer")
*(17)
PAUP/NEXUS ("nexus")
*(18)
Phrap assembly file ("phrap")
*(19)
NBRF/PIR ("pir")
*(20)
Mase ("mase")
*(21)
Mega ("mega")
*(22)
GCG/MSF ("msf")
*(23)
Pfam ("pfam")
*(24)
Phylip ("phylip")
*(25)
Prodom ("prodom")
*(26)
Raw ("raw")
*(27)
RSF ("rsf")
*(28)
Selex ("selex")
*(29)
Stockholm ("stockholm")
*(30)
Swissprot ("swiss")
*(31)
Tab ("tab")
Methods
Methods description
Title : new Usage : $guesser = new Bio::Tools::GuessSeqFormat( ... ); Function : Creates a new object. Example : See SYNOPSIS. Returns : A new object. Arguments : -file The filename of the file whose format is to be guessed, or -fh An already opened filehandle from which a text stream may be read, or -text A scalar containing one or several lines of text with embedded newlines.
If more than one of the above arguments are given, they
are tested in the order -text, -file, -fh, and the first
available argument will be used. |
Title : file Usage : $guesser->file($filename); $filename = $guesser->file; Function : Gets or sets the current filename associated with an object. Returns : The new filename. Arguments : The filename of the file whose format is to be guessed.
A call to this method will clear the current filehandle and
the current lines of text associated with the object. |
Title : fh Usage : $guesser->fh($filehandle); $filehandle = $guesser->fh; Function : Gets or sets the current filehandle associated with an object. Returns : The new filehandle. Arguments : An already opened filehandle from which a text stream may be read.
A call to this method will clear the current filename and
the current lines of text associated with the object. |
Title : text Usage : $guesser->text($linesoftext); $linesofext = $guesser->text; Function : Gets or sets the current text associated with an object. Returns : The new lines of texts. Arguments : A scalar containing one or several lines of text, including embedded newlines.
A call to this method will clear the current filename and
the current filehandle associated with the object. |
Title : guess Usage : $format = $guesser->guess; @format = $guesser->guess; # if given a line of text Function : Guesses the format of the data accociated with the object. Returns : A format string such as "swiss" or "pir". If a format can not be found, undef is returned. Arguments : None.
If the object is associated with a filehandle and if that
filehandle is searchable, the position of the filehandle
will be returned to its original position before the method
returns. |
From various blast results. |
| From bioperl, Bio::SeqIO::gcg. |
| From the ensembl broswer (AlignView data export). |
Methods code
sub new
{
my $class = shift;
my @args = @_;
my $self = $class->SUPER::new(@args);
my $attr;
my $value;
while (@args) {
$attr = shift @args;
$attr = lc $attr;
$value = shift @args;
$self->{$attr} = $value;
}
return $self;} |
sub file
{
my $self = shift;
my $file = shift;
if (defined $file) {
$self->{-file} = $file;
$self->{-fh} = $self->{-text} = undef;
}
return $self->{-file};} |
sub fh
{
my $self = shift;
my $fh = shift;
if (defined $fh) {
$self->{-fh} = $fh;
$self->{-file} = $self->{-text} = undef;
}
return $self->{-fh};} |
sub text
{
my $self = shift;
my $text = shift;
if (defined $text) {
$self->{-text} = $text;
$self->{-fh} = $self->{-file} = undef;
}
return $self->{-text};} |
sub guess
{
my $self = shift;
foreach my $fmt_key (keys %formats) {
$formats{$fmt_key}{fmt_string} = $fmt_key;
}
my $fh;
my $start_pos;
my @lines;
if (defined $self->{-text}) {
@lines = split /\n/, $self->{-text};
} elsif (defined $self->{-file}) {
open($fh, $self->{-file}) or
$self->throw("Can not open '$self->{-file}' for reading: $!");
} elsif (defined $self->{-fh}) {
$fh = $self->{-fh};
if (ref $fh eq 'GLOB') {
$start_pos = tell($fh);
} elsif (UNIVERSAL::isa($fh, 'IO::Seekable')) {
$start_pos = $fh->getpos();
}
}
my $done = 0;
my $lineno = 0;
my $fmt_string;
while (!$done) {
my $line; my $match = 0;
if (defined $self->{-text}) {
last if (scalar @lines == 0);
$line = shift @lines;
} else {
last if (!defined($line = <$fh>));
}
next if ($line =~ /^\s*$/);
chomp($line);
$line =~ s/\r$//; ++$lineno;
while (my ($fmt_key, $fmt) = each (%formats)) {
if ($fmt->{test}($line, $lineno)) {
++$match;
$fmt_string = $fmt->{fmt_string};
}
}
$done = ($match == 1);
}
if (defined $self->{-file}) {
close($fh);
} elsif (ref $fh eq 'GLOB') {
seek($fh, $start_pos, 0);
} elsif (defined $fh && $fh->can('setpos')) {
$fh->setpos($start_pos);
}
return ($done ? $fmt_string : undef);} |
sub _possibly_ace
{
my ($line, $lineno) = (shift, shift);
return ($line =~ /^(?:Sequence|Peptide|DNA|Protein) [":]/);} |
sub _possibly_blast
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 &&
$line =~ /^[[:upper:]]*BLAST[[:upper:]]*.*\[.*\]$/);} |
sub _possibly_clustalw
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 && $line =~ /CLUSTAL/);} |
sub _possibly_codata
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^ENTRY/) ||
($lineno == 2 && $line =~ /^SEQUENCE/) ||
$line =~ m{^(?:ENTRY|SEQUENCE|///)});} |
sub _possibly_embl
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 && $line =~ /^ID / && $line =~ /BP\.$/);} |
sub _possibly_fasta
{
my ($line, $lineno) = (shift, shift);
return (($lineno != 1 && $line =~ /^[A-IK-NP-Z]+$/i) ||
$line =~ /^>\s*\w/);} |
sub _possibly_fastxy
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^ FAST(?:XY|A)/) ||
($lineno == 2 && $line =~ /^ version \d/));} |
sub _possibly_game
{
my ($line, $lineno) = (shift, shift);
return ($line =~ /^<!DOCTYPE game/);} |
sub _possibly_gcg
{
my ($line, $lineno) = (shift, shift);
return ($line =~ /Length: .*Type: .*Check: .*\.\.$/);} |
sub _possibly_gcgblast
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^!!SEQUENCE_LIST/) ||
($lineno == 2 &&
$line =~ /^[[:upper:]]*BLAST[[:upper:]]*.*\[.*\]$/));} |
sub _possibly_gcgfasta
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^!!SEQUENCE_LIST/) ||
($lineno == 2 && $line =~ /FASTA/));} |
sub _possibly_gde
{
my ($line, $lineno) = (shift, shift);
return ($line =~ /^[{}]$/ ||
$line =~ /^(?:name|longname|sequence-ID| creation-date|direction|strandedness| type|offset|group-ID|creator|descrip| comment|sequence)/x);} |
sub _possibly_genbank
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /GENETIC SEQUENCE DATA BANK/) ||
($lineno == 1 && $line =~ /^LOCUS /) ||
($lineno == 2 && $line =~ /^DEFINITION /) ||
($lineno == 3 && $line =~ /^ACCESSION /));} |
sub _possibly_genscan
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^GENSCAN.*Date.*Time/) ||
($line =~ /^(?:Sequence\s+\w+|Parameter matrix|Predicted genes)/));} |
sub _possibly_gff
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^##gff-version/) ||
($lineno == 2 && $line =~ /^##date/));} |
sub _possibly_hmmer
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 2 && $line =~ /^HMMER/) ||
($lineno == 3 &&
$line =~ /Washington University School of Medicine/));} |
sub _possibly_nexus
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 && $line =~ /^#NEXUS/);} |
sub _possibly_mase
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^;;/) ||
($lineno > 1 && $line =~ /^;[^;]?/));} |
sub _possibly_mega
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 && $line =~ /^#mega$/);} |
sub _possibly_msf
{
my ($line, $lineno) = (shift, shift);
return ($line =~ m{^//} ||
$line =~ /MSF:.*Type:.*Check:|Name:.*Len:/);} |
sub _possibly_phrap
{
my ($line, $lineno) = (shift, shift);
return ($line =~ /^(?:AS\ |CO\ Contig|BQ|AF\ |BS\ |RD\ | QA\ |DS\ |RT\{)/x);} |
sub _possibly_pir
{ my ($line, $lineno) = (shift, shift);
return (($lineno != 1 && $line =~ /^[\sA-IK-NP-Z.,()]+\*?$/i) ||
$line =~ /^>(?:P1|F1|DL|DC|RL|RC|N3|N1);/); } |
sub _possibly_pfam
{
my ($line, $lineno) = (shift, shift);
return ($line =~ m{^\w+/\d+-\d+\s+[A-IK-NP-Z.]+}i);} |
sub _possibly_phylip
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^\s*\d+\s\d+/) ||
($lineno == 2 && $line =~ /^\w\s+[A-IK-NP-Z\s]+/) ||
($lineno == 3 && $line =~ /(?:^\w\s+[A-IK-NP-Z\s]+|\s+[A-IK-NP-Z\s]+)/)
);} |
sub _possibly_prodom
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 && $line =~ /^ID / && $line =~ /\d+ seq\.$/);} |
sub _possibly_raw
{
my ($line, $lineno) = (shift, shift);
return ($line =~ /^(?:[sA-IK-NP-Z]+|[sa-ik-np-z]+)$/);} |
sub _possibly_rsf
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^!!RICH_SEQUENCE/) ||
$line =~ /^[{}]$/ ||
$line =~ /^(?:name|type|longname| checksum|creation-date|strand|sequence)/x);} |
sub _possibly_selex
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^#=ID /) ||
($lineno == 2 && $line =~ /^#=AC /) ||
($line =~ /^#=SQ /));} |
sub _possibly_stockholm
{
my ($line, $lineno) = (shift, shift);
return (($lineno == 1 && $line =~ /^# STOCKHOLM/) ||
$line =~ /^#=(?:GF|GS) /);} |
sub _possibly_swiss
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 && $line =~ /^ID / && $line =~ /AA\.$/);} |
sub _possibly_tab
{
my ($line, $lineno) = (shift, shift);
return ($lineno == 1 && $line =~ /^[^\t]+\t[^\t]+/) ;} |
General documentation
User feedback is an integral part of the evolution of this and
other Bioperl modules. Send your comments and suggestions
preferably to one of the Bioperl mailing lists. Your
participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Report bugs to the Bioperl bug tracking system to help us
keep track the bugs and their resolution. Bug reports can be
submitted via the web:
http://bugzilla.open-bio.org/
Heikki Lehväslaiho, heikki-at-bioperl-dot-org
All helper subroutines will, given a line of text and the line
number of the same line, return 1 if the line possibly is from a
file of the type that they perform a test of.
A zero return value does not mean that the line is not part
of a certain type of file, just that the test did not find any
characteristics of that type of file in the line.