Bio::SeqIO
genbank
Toolbar
Summary
Bio::SeqIO::genbank - GenBank sequence input/output stream
Package variables
No package variables defined.
Included modules
Inherit
Synopsis
It is probably best not to use this object directly, but
rather go through the SeqIO handler:
$stream = Bio::SeqIO->new(-file => $filename,
-format => 'GenBank');
while ( my $seq = $stream->next_seq() ) {
# do something with $seq
}
Description
This object can transform Bio::Seq objects to and from GenBank flat
file databases.
There is some flexibility here about how to write GenBank output
that is not fully documented.
_show_dna()
(output only) shows the dna or not
_post_sort()
(output only) provides a sorting func which is applied to the FTHelpers
before printing
_id_generation_func()
This is function which is called as
print "ID ", $func($seq), "\n";
To generate the ID line. If it is not there, it generates a sensible ID
line using a number of tools.
If you want to output annotations in Genbank format they need to be
stored in a Bio::Annotation::Collection object which is accessible
through the Bio::SeqI interface method
annotation().
The following are the names of the keys which are pulled from a
Bio::Annotation::Collection object:
reference - Should contain Bio::Annotation::Reference objects
comment - Should contain Bio::Annotation::Comment objects
dblink - Should contain a Bio::Annotation::DBLink object
segment - Should contain a Bio::Annotation::SimpleValue object
origin - Should contain a Bio::Annotation::SimpleValue object
wgs - Should contain a Bio::Annotation::SimpleValue object
Methods
Methods description
Title : next_seq Usage : $seq = $stream->next_seq() Function: returns the next sequence in the stream Returns : Bio::Seq object Args : |
Title : write_seq Usage : $stream->write_seq($seq) Function: writes the $seq object (must be seq) to the stream Returns : 1 for success and 0 for error Args : array of 1 to n Bio::SeqI objects |
Title : _print_GenBank_FTHelper Usage : Function: Example : Returns : Args : |
Title : _read_GenBank_References Usage : Function: Reads references from GenBank format. Internal function really Returns : Args : |
Title : _read_GenBank_Species Usage : Function: Reads the GenBank Organism species and classification lines. Able to deal with unconvential Organism naming formats, and varietas in plants Example : ORGANISM unknown marine gamma proteobacterium NOR5 $genus = undef $species = unknown marine gamma proteobacterium NOR5
ORGANISM Drosophila sp. 'white tip scutellum'
$genus = Drosophila
$species = sp. 'white tip scutellum'
(yes, this really is a species and that is its name)
$subspecies = undef
ORGANISM Ajellomyces capsulatus var. farciminosus
$genus = Ajellomyces
$species = capsulatus
$subspecies = var. farciminosus
ORGANISM Hepatitis delta virus
$genus = undef (though this virus has a genus in its lineage, we
cannot know that without a database lookup)
$species = Hepatitis delta virus
Returns : A Bio::Species object
Args : A reference to the current line buffer |
Title : _read_FTHelper_GenBank Usage : _read_FTHelper_GenBank($buffer) Function: reads the next FT key line Example : Returns : Bio::SeqIO::FTHelper object Args : filehandle and reference to a scalar |
Title : _write_line_GenBank Usage : Function: internal function Example : Returns : Args : |
Title : _write_line_GenBank_regex Usage : Function: internal function for writing lines of specified length, with different first and the next line left hand headers and split at specific points in the text Example : Returns : nothing Args : file handle, first header, second header, text-line, regex for line breaks, total line length |
Title : _post_sort Usage : $obj->_post_sort($newval) Function: Returns : value of _post_sort Args : newvalue (optional) |
Title : _show_dna Usage : $obj->_show_dna($newval) Function: Returns : value of _show_dna Args : newvalue (optional) |
Title : _id_generation_func Usage : $obj->_id_generation_func($newval) Function: Returns : value of _id_generation_func Args : newvalue (optional) |
Title : _ac_generation_func Usage : $obj->_ac_generation_func($newval) Function: Returns : value of _ac_generation_func Args : newvalue (optional) |
Title : _sv_generation_func Usage : $obj->_sv_generation_func($newval) Function: Returns : value of _sv_generation_func Args : newvalue (optional) |
Title : _kw_generation_func Usage : $obj->_kw_generation_func($newval) Function: Returns : value of _kw_generation_func Args : newvalue (optional) |
Methods code
sub _initialize
{ my($self,@args) = @_;
$self->SUPER::_initialize(@args);
$self->{'_func_ftunit_hash'} = {};
$self->_show_dna(1); if( ! defined $self->sequence_factory ) {
$self->sequence_factory(Bio::Seq::SeqFactory->new
(-verbose => $self->verbose(),
-type => 'Bio::Seq::RichSeq'));
}} |
sub next_seq
{ my ($self,@args) = @_;
my %args = @args;
my $builder = $self->sequence_builder();
my $seq;
my %params;
RECORDSTART:
while (1) {
my $buffer;
my (@acc, @features);
my ($display_id, $annotation);
my $species;
@features = ();
$annotation = undef;
@acc = ();
$species = undef;
%params = (-verbose => $self->verbose); local($/) = "\n";
while(defined($buffer = $self->_readline())) {
last if index($buffer,'LOCUS ') == 0;
}
return unless defined $buffer; $buffer =~ /^LOCUS\s+(\S.*)$/o ||
$self->throw("GenBank stream with bad LOCUS line. Not GenBank in my book. Got '$buffer'");
my @tokens = split(' ', $1);
$display_id = shift(@tokens);
$params{'-display_id'} = $display_id;
my $seqlength = shift(@tokens);
if (exists $VALID_ALPHABET{$seqlength}) {
$self->warn("Bad LOCUS name? Changing [$params{'-display_id'}] to 'unknown' and length to $display_id");
$params{'-display_id'} = 'unknown';
$params{'-length'} = $display_id;
unshift @tokens, $seqlength;
} else {
$params{'-length'} = $seqlength;
}
my $alphabet = lc(shift @tokens);
$params{'-alphabet'} = (exists $VALID_ALPHABET{$alphabet}) ? $VALID_ALPHABET{$alphabet} :
$self->warn("Unknown alphabet: $alphabet");
if ($params{'-alphabet'} eq 'protein') {
$params{'-molecule'} = 'PRT'
} else {
$params{'-molecule'} = shift(@tokens);
}
if ($params{'-molecule'} eq 'dna' || $params{'-molecule'} eq 'rna') {
$params{'-molecule'} = uc $params{'-molecule'};
}
$self->debug("Unrecognized molecule type:".$params{'-molecule'}) if
!exists($VALID_MOLTYPE{$params{'-molecule'}});
my $circ = shift(@tokens);
if ($circ eq 'circular') {
$params{'-is_circular'} = 1;
$params{'-division'} = shift(@tokens);
} else {
$params{'-division'} =
(CORE::length($circ) == 3 ) ? $circ : shift(@tokens);
}
my $date = join(' ', @tokens);
if($date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/) {
if( length($date) < 11 ) {
my ($d,$m,$y) = ($2,$3,$4);
if( length($d) == 1 ) {
$d = "0$d";
}
if( length($y) == 2 ) {
if( $y > 60 ) { $y = "19$y";
} else {
$y = "20$y";
}
$self->warn("Date was malformed, guessing the century for $date to be $y\n");
}
$params{'-dates'} = [join('-',$d,$m,$y)];
} else {
$params{'-dates'} = [$date];
}
}
$builder->add_slot_value(%params);
%params = ();
if(! $builder->want_object()) {
$builder->make_object();
next RECORDSTART;
}
if($builder->want_slot('annotation')) {
$annotation = Bio::Annotation::Collection->new();
}
$buffer = $self->_readline();
until( !defined ($buffer) ) {
$_ = $buffer;
if (/^DEFINITION\s+(\S.*\S)/) {
my @desc = ($1);
while ( defined($_ = $self->_readline) ) {
if( /^\s+(.*)/ ) { push (@desc, $1); next };
last;
}
$builder->add_slot_value(-desc => join(' ', @desc));
$buffer= $_;
}
if( /^ACCESSION\s+(\S.*\S)/ ) {
push(@acc, split(/\s+/,$1));
while( defined($_ = $self->_readline) ) {
/^\s+(.*)/ && do { push (@acc, split(/\s+/,$1)); next };
last;
}
$buffer = $_;
next;
}
elsif( /^PID\s+(\S+)/ ) {
$params{'-pid'} = $1;
}
elsif( /^VERSION\s+(\S.+)$/ ) {
my ($acc,$gi) = split(' ',$1);
if($acc =~ /^\w+\.(\d+)/) {
$params{'-version'} = $1;
$params{'-seq_version'} = $1;
}
if($gi && (index($gi,"GI:") == 0)) {
$params{'-primary_id'} = substr($gi,3);
}
}
elsif( /^KEYWORDS\s+(\S.*)/ ) {
my @kw = split(/\s*\;\s*/,$1);
while( defined($_ = $self->_readline) ) {
chomp;
/^\s+(.*)/ && do { push (@kw, split(/\s*\;\s*/,$1)); next };
last;
}
@kw && $kw[-1] =~ s/\.$//;
$params{'-keywords'} =\@ kw;
$buffer = $_;
next;
}
elsif (/^SOURCE\s+\S/) {
if($builder->want_slot('species')) {
$species = $self->_read_GenBank_Species(\$buffer);
$builder->add_slot_value(-species => $species);
} else {
while(defined($buffer = $self->_readline())) {
last if substr($buffer,0,1) ne ' ';
}
}
next;
}
elsif (/^REFERENCE\s+\S/) {
if($annotation) {
my @refs = $self->_read_GenBank_References(\$buffer);
foreach my $ref ( @refs ) {
$annotation->add_Annotation('reference',$ref);
}
} else {
while(defined($buffer = $self->_readline())) {
last if substr($buffer,0,1) ne ' ';
}
}
next;
}
elsif (/^PROJECT\s+(\S.*)/) {
if ($annotation) {
my $project = new Bio::Annotation::SimpleValue->new(-value => $1);
$annotation->add_Annotation('project',$project);
}
}
elsif (/^COMMENT\s+(\S.*)/) {
if($annotation) {
my $comment = $1;
while (defined($_ = $self->_readline)) {
last if (/^\S/);
$comment .= $_;
}
$comment =~ s/\n/ /g;
$comment =~ s/ +/ /g;
$annotation->add_Annotation('comment',
Bio::Annotation::Comment->new(-text => $comment,
-tagname => 'comment'));
$buffer = $_;
} else {
while(defined($buffer = $self->_readline())) {
last if substr($buffer,0,1) ne ' ';
}
}
next;
}
elsif( /^DB(?:SOURCE|LINK)\s+(\S.+)/ ) {
if ($annotation) {
my $dbsource = $1;
while (defined($_ = $self->_readline)) {
last if (/^\S/);
$dbsource .= $_;
}
if( $dbsource =~ s/(UniProt(?:KB)?|swissprot):\s+locus\s+(\S+)\,.+\n// ) {
$annotation->add_Annotation
('dblink',
Bio::Annotation::DBLink->new
(-primary_id => $2,
-database => $1,
-tagname => 'dblink'));
if( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) {
$annotation->add_Annotation
('swissprot_dates',
Bio::Annotation::SimpleValue->new
(-tagname => 'date_created',
-value => $1));
}
while( $dbsource =~ s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g ) {
$annotation->add_Annotation
('swissprot_dates',
Bio::Annotation::SimpleValue->new
(-tagname => 'date_updated',
-value => $2));
}
$dbsource =~ s/\n/ /g;
if( $dbsource =~ s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/ ) {
my $i = 0;
for my $dbsrc ( split(/,\s+/,$1) ) {
if( $dbsrc =~ /(\S+)\.(\d+)/ ||
$dbsrc =~ /(\S+)/ ) {
my ($id,$version) = ($1,$2);
$version ='' unless defined $version;
my $db;
if( $id =~ /^\d\S{3}/) {
$db = 'PDB';
} else {
$db = ($i++ % 2 ) ? 'GenPept' : 'GenBank';
}
$annotation->add_Annotation
('dblink',
Bio::Annotation::DBLink->new
(-primary_id => $id,
-version => $version,
-database => $db,
-tagname => 'dblink'));
}
}
} elsif( $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i ) {
my $i = 0;
for my $id ( split(/\,\s+/,$1) ) {
my ($acc,$db);
if( $id =~ /gi:\s+(\d+)/ ) {
$acc= $1;
$db = ($i++ % 2 ) ? 'GenPept' : 'GenBank';
} elsif( $id =~ /pdb\s+accession\s+(\S+)/ ) {
$acc= $1;
$db = 'PDB';
} else {
$acc= $id;
$db = '';
}
$annotation->add_Annotation
('dblink',
Bio::Annotation::DBLink->new
(-primary_id => $acc,
-database => $db,
-tagname => 'dblink'));
}
} else {
$self->debug("Cannot match $dbsource\n");
}
if( $dbsource =~ s/xrefs\s+\(non\-sequence\s+databases\):\s+ ((?:\S+,\s+)+\S+)//x ) {
for my $id ( split(/\,\s+/,$1) ) {
my $db;
$db = substr($id,0,index($id,':'));
if (! exists $DBSOURCE{ $db }) {
$db = ''; }
$id = substr($id,index($id,':')+1);
$annotation->add_Annotation
('dblink',
Bio::Annotation::DBLink->new
(-primary_id => $id,
-database => $db,
-tagname => 'dblink'));
}
}
} else {
if( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/ ) {
my ($db,$id,$version) = ($1,$2,$3);
$annotation->add_Annotation
('dblink',
Bio::Annotation::DBLink->new
(-primary_id => $id,
-version => $version,
-database => $db || 'GenBank',
-tagname => 'dblink'));
} elsif ( $dbsource =~ /(\S+)([\.:])\s*(\d+)/ ) {
my ($id, $db, $version);
if ($2 eq ':') {
($db, $id) = ($1, $3);
} else {
($db, $id, $version) = ('GenBank', $1, $3);
}
$annotation->add_Annotation('dblink',
Bio::Annotation::DBLink->new(
-primary_id => $id,
-version => $version,
-database => $db,
-tagname => 'dblink')
);
} else {
$self->warn("Unrecognized DBSOURCE data: $dbsource\n");
}
}
$buffer = $_;
} else {
while(defined($buffer = $self->_readline())) {
last if substr($buffer,0,1) ne ' ';
}
}
next;
}
last if( /^(FEATURES|ORIGIN)/ );
$buffer = $self->_readline;
}
return unless defined $buffer;
$builder->add_slot_value(-accession_number => shift(@acc),
-secondary_accessions =>\@ acc,
%params);
$builder->add_slot_value(-annotation => $annotation) if $annotation;
%params = ();
if(! $builder->want_object()) {
$builder->make_object();
next RECORDSTART;
}
if($builder->want_slot('features') && defined($_) && /^FEATURES/o) {
$buffer = $self->_readline;
while( defined($buffer) ) {
last if( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o);
my $ftunit = $self->_read_FTHelper_GenBank(\$buffer);
if( !defined $ftunit ) {
$self->warn("Unexpected error in feature table for ".$params{'-display_id'}." Skipping feature, attempting to recover");
unless( ($buffer =~ /^\s{5,5}\S+/o) or
($buffer =~ /^\S+/o)) {
$buffer = $self->_readline();
}
next; }
my $feat =
$ftunit->_generic_seqfeature($self->location_factory(),
$display_id);
if($species && ($feat->primary_tag eq 'source') &&
$feat->has_tag('db_xref') && (! $species->ncbi_taxid() ||
($species->ncbi_taxid && $species->ncbi_taxid =~ /^list/))) {
foreach my $tagval ($feat->get_tag_values('db_xref')) {
if(index($tagval,"taxon:") == 0) {
$species->ncbi_taxid(substr($tagval,6));
last;
}
}
}
push(@features, $feat);
}
$builder->add_slot_value(-features =>\@ features);
$_ = $buffer;
}
if( defined ($_) ) {
if( /^CONTIG/o ) {
my @contig;
my $ctg = '';
while($_ !~ m{^//}) { $_ =~ /^(?:CONTIG)?\s+(.*)/;
$ctg .= $1;
$_ = $self->_readline;
}
if ($ctg) {
$annotation->add_Annotation(
Bio::Annotation::SimpleValue->new(-tagname => 'contig',
-value => $ctg )
);
}
$self->_pushback($_);
} elsif( /^WGS|WGS_SCAFLD\s+/o ) { while($_ =~ s/(^WGS|WGS_SCAFLD)\s+//){ chomp;
$annotation->add_Annotation(
Bio::Annotation::SimpleValue->new(-value => $_,
-tagname => lc($1)));
$_ = $self->_readline;
}
} elsif(! m{^(ORIGIN|//)} ) { while (defined( $_ = $self->_readline) ) {
last if m{^(ORIGIN|//)};
}
}
}
if(! $builder->want_object()) {
$builder->make_object(); next RECORDSTART;
}
if($builder->want_slot('seq')) {
if(defined($_) && s/^ORIGIN\s+//) {
chomp;
if( $annotation && length($_) > 0 ) {
$annotation->add_Annotation('origin',
Bio::Annotation::SimpleValue->new(-tagname => 'origin',
-value => $_));
}
my $seqc = '';
while( defined($_ = $self->_readline) ) {
m{^//} && last;
$_ = uc($_);
s/[^A-Za-z]//g;
$seqc .= $_;
}
$builder->add_slot_value(-seq => $seqc);
}
} elsif ( defined($_) && (substr($_,0,2) ne '//')) {
while( defined($_ = $self->_readline) ) {
last if substr($_,0,2) eq '//';
}
}
$seq = $builder->make_object();
next RECORDSTART unless $seq;
last RECORDSTART;
}
return $seq;} |
sub write_seq
{ my ($self,@seqs) = @_;
foreach my $seq ( @seqs ) {
$self->throw("Attempting to write with no seq!") unless defined $seq;
if( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {
$self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
}
my $str = $seq->seq;
my ($div, $mol);
my $len = $seq->length();
if ( $seq->can('division') ) {
$div = $seq->division;
}
if( !defined $div || ! $div ) { $div = 'UNK'; }
my $alpha = $seq->alphabet;
if( !$seq->can('molecule') || ! defined ($mol = $seq->molecule()) ) {
$mol = $alpha || 'DNA';
}
my $circular = 'linear ';
$circular = 'circular' if $seq->is_circular;
local($^W) = 0;
my $temp_line;
if( $self->_id_generation_func ) {
$temp_line = &{$self->_id_generation_func}($seq);
} else {
my $date = '';
if( $seq->can('get_dates') ) {
($date) = $seq->get_dates();
}
$self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]")
if $seq->display_id =~ /\s/;
$temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s\n",
'LOCUS', $seq->id(),$len,
(lc($alpha) eq 'protein') ? ('aa','', '') :
('bp', '',$mol),$circular,$div,$date);
}
$self->_print($temp_line);
$self->_write_line_GenBank_regex("DEFINITION ", " ",
$seq->desc(),"\\s\+\|\$",80);
if( $self->_ac_generation_func ) {
$temp_line = &{$self->_ac_generation_func}($seq);
$self->_print("ACCESSION $temp_line\n");
} else {
my @acc = ();
push(@acc, $seq->accession_number());
if( $seq->isa('Bio::Seq::RichSeqI') ) {
push(@acc, $seq->get_secondary_accessions());
}
$self->_print("ACCESSION ", join(" ", @acc), "\n");
}
if($seq->isa('Bio::Seq::RichSeqI') && $seq->pid()) {
$self->_print("PID ", $seq->pid(), "\n");
}
if( defined $self->_sv_generation_func() ) {
$temp_line = &{$self->_sv_generation_func}($seq);
if( $temp_line ) {
$self->_print("VERSION $temp_line\n");
}
} else {
if($seq->isa('Bio::Seq::RichSeqI') && defined($seq->seq_version)) {
my $id = $seq->primary_id(); $self->_print("VERSION ",
$seq->accession_number(), ".", $seq->seq_version,
($id && ($id =~ /^\d+$/) ? " GI:".$id : ""),
"\n");
}
}
for my $proj ( $seq->annotation->get_Annotations('project') ) {
$self->_print("PROJECT ".$proj->value."\n");
}
foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) {
my ($db, $id) = ($ref->database, $ref->primary_id);
my $prefix = $db eq 'Project' ? 'DBLINK' : 'DBSOURCE';
my $text = $db eq 'GenBank' ? '' :
$db eq 'Project' ? "$db:$id" : "$db accession $id";
$self->_print(sprintf ("%-11s %s\n",$prefix, $text));
}
if( defined $self->_kw_generation_func() ) {
$temp_line = &{$self->_kw_generation_func}($seq);
$self->_print("KEYWORDS $temp_line\n");
} else {
if( $seq->can('keywords') ) {
my $kw = $seq->keywords;
$kw .= '.' if( $kw !~ /\.$/ );
$self->_print("KEYWORDS $kw\n");
}
}
foreach my $ref ( $seq->annotation->get_Annotations('segment') ) {
$self->_print(sprintf ("%-11s %s\n",'SEGMENT',
$ref->value));
}
if (my $spec = $seq->species) {
my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '',
$spec->scientific_name,
$spec->common_name);
my @classification;
if ($spec->isa('Bio::Species')) {
@classification = $spec->classification;
shift(@classification);
} else {
my $node = $spec;
while ($node) {
$node = $node->ancestor || last;
unshift(@classification, $node->node_name);
}
@classification = reverse @classification;
}
my $abname = $spec->name('abbreviated') ? $spec->name('abbreviated')->[0] : $sn;
my $sl = $on ? "$on " : '';
$sl .= $cn ? $abname." ($cn)" : "$abname";
$self->_write_line_GenBank_regex("SOURCE ", ' 'x12, $sl, "\\s\+\|\$",80);
$self->_print(" ORGANISM ", $spec->scientific_name, "\n");
my $OC = join('; ', (reverse(@classification))) .'.';
$self->_write_line_GenBank_regex(' 'x12,' 'x12,
$OC,"\\s\+\|\$",80);
}
my $count = 1;
foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
$temp_line = "REFERENCE $count";
if ($ref->start) {
$temp_line .= sprintf (" (%s %d to %d)",
($seq->alphabet() eq "protein" ?
"residues" : "bases"),
$ref->start,$ref->end);
} elsif ($ref->gb_reference) {
$temp_line .= sprintf (" (%s)", $ref->gb_reference);
}
$self->_print("$temp_line\n");
$self->_write_line_GenBank_regex(" AUTHORS ",' 'x12,
$ref->authors,"\\s\+\|\$",80);
$self->_write_line_GenBank_regex(" CONSRTM ",' 'x12,
$ref->consortium,"\\s\+\|\$",80) if $ref->consortium;
$self->_write_line_GenBank_regex(" TITLE "," "x12,
$ref->title,"\\s\+\|\$",80);
$self->_write_line_GenBank_regex(" JOURNAL "," "x12,
$ref->location,"\\s\+\|\$",80);
if( $ref->medline) {
$self->_write_line_GenBank_regex(" MEDLINE "," "x12,
$ref->medline, "\\s\+\|\$",80);
}
if( $ref->pubmed ) {
$self->_write_line_GenBank_regex(" PUBMED "," "x12,
$ref->pubmed, "\\s\+\|\$",
80);
}
if ($ref->comment) {
$self->_write_line_GenBank_regex(" REMARK "," "x12,
$ref->comment,"\\s\+\|\$",80);
}
$count++;
}
foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
$self->_write_line_GenBank_regex("COMMENT "," "x12,
$comment->text,"\\s\+\|\$",80);
}
$self->_print("FEATURES Location/Qualifiers\n");
if( defined $self->_post_sort ) {
my $post_sort_func = $self->_post_sort();
my @fth;
foreach my $sf ( $seq->top_SeqFeatures ) {
push(@fth,Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq));
}
@fth = sort { &$post_sort_func($a,$b) } @fth;
foreach my $fth ( @fth ) {
$self->_print_GenBank_FTHelper($fth);
}
} else {
foreach my $sf ( $seq->top_SeqFeatures ) {
my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq);
foreach my $fth ( @fth ) {
if( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
$sf->throw("Cannot process FTHelper... $fth");
}
$self->_print_GenBank_FTHelper($fth);
}
}
}
if($seq->annotation->get_Annotations('wgs')) {
foreach my $wgs
(map {$seq->annotation->get_Annotations($_)} qw(wgs wgs_scaffold)) {
$self->_print(sprintf ("%-11s %s\n",uc($wgs->tagname),
$wgs->value));
}
$self->_show_dna(0);
}
if($seq->annotation->get_Annotations('contig')) {
my $ct = 0;
my $cline;
foreach my $contig ($seq->annotation->get_Annotations('contig')) {
unless ($ct) {
$cline = uc($contig->tagname)." ".$contig->value."\n";
} else {
$cline = " ".$contig->value."\n";
}
$self->_print($cline);
$ct++;
}
$self->_show_dna(0);
}
if( $seq->length == 0 ) { $self->_show_dna(0) }
if( $self->_show_dna() == 0 ) {
$self->_print("\n//\n");
return;
}
$str =~ tr/A-Z/a-z/;
my ($o) = $seq->annotation->get_Annotations('origin');
$self->_print(sprintf("%-12s%s\n",
'ORIGIN', $o ? $o->value : ''));
my $nuc = 60; my $whole_pat = 'a10' x 6; my $out_pat = 'A11' x 6; my $length = length($str);
my $whole = int($length / $nuc) * $nuc;
my $i;
for ($i = 0; $i < $whole; $i += $nuc) {
my $blocks = pack $out_pat,
unpack $whole_pat,
substr($str, $i, $nuc);
chop $blocks;
$self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59));
}
if (my $last = substr($str, $i)) {
my $last_len = length($last);
my $last_pat = 'a10' x int($last_len / 10) . 'a'. $last_len % 10; my $blocks = pack $out_pat,
unpack($last_pat, $last);
$blocks =~ s/ +$//;
$self->_print(sprintf("%9d $blocks\n",
$length - $last_len + 1));
}
$self->_print("//\n");
$self->flush if $self->_flush_on_write && defined $self->_fh;
return 1;
}} |
sub _print_GenBank_FTHelper
{ my ($self,$fth) = @_;
if( ! ref $fth || ! $fth->isa('Bio::SeqIO::FTHelper') ) {
$fth->warn("$fth is not a FTHelper class. Attempting to print, but there could be tears!");
}
my $spacer = (length $fth->key >= 15) ? ' ' : '';
$self->_write_line_GenBank_regex(sprintf(" %-16s%s",$fth->key,$spacer),
" "x21,
$fth->loc,"\,\|\$",80);
foreach my $tag ( keys %{$fth->field} ) {
foreach my $value ( @{$fth->field->{$tag}} ) {
$value =~ s/\"/\"\"/g;
if ($value eq "_no_value") {
$self->_write_line_GenBank_regex(" "x21,
" "x21,
"/$tag","\.\|\$",80);
}
elsif (!$FTQUAL_NO_QUOTE{$tag}) {
my ($pat) = ($value =~ /\s/ ? '\s|$' : '.|$');
$self->_write_line_GenBank_regex(" "x21,
" "x21,
"/$tag=\"$value\"",$pat,80);
} else {
$self->_write_line_GenBank_regex(" "x21,
" "x21,
"/$tag=$value","\.\|\$",80);
}
}
}} |
sub _read_GenBank_References
{ my ($self,$buffer) = @_;
my (@refs);
my $ref;
if( $$buffer !~ /^REFERENCE/ ) {
warn("Not parsing line '$$buffer' which maybe important");
}
$_ = $$buffer;
my (@title,@loc,@authors,@consort,@com,@medline,@pubmed);
REFLOOP: while( defined($_) || defined($_ = $self->_readline) ) {
if (/^\s{2}AUTHORS\s+(.*)/o) {
push (@authors, $1);
while ( defined($_ = $self->_readline) ) {
/^\s{9,}(.*)/o && do { push (@authors, $1);next;};
last;
}
$ref->authors(join(' ', @authors));
}
if (/^\s{2}CONSRTM\s+(.*)/o) {
push (@consort, $1);
while ( defined($_ = $self->_readline) ) {
/^\s{9,}(.*)/o && do { push (@consort, $1);next;};
last;
}
$ref->consortium(join(' ', @consort));
}
if (/^\s{2}TITLE\s+(.*)/o) {
push (@title, $1);
while ( defined($_ = $self->_readline) ) {
/^\s{9,}(.*)/o && do { push (@title, $1);
next;
};
last;
}
$ref->title(join(' ', @title));
}
if (/^\s{2}JOURNAL\s+(.*)/o) {
push(@loc, $1);
while ( defined($_ = $self->_readline) ) {
/^\s{9,}(.*)/o && do { push(@loc, $1);
next;
};
last;
}
$ref->location(join(' ', @loc));
redo REFLOOP;
}
if (/^\s{2}REMARK\s+(.*)/o) {
push (@com, $1);
while ( defined($_ = $self->_readline) ) {
/^\s{9,}(.*)/o && do { push(@com, $1);
next;
};
last;
}
$ref->comment(join(' ', @com));
redo REFLOOP;
}
if( /^\s{2}MEDLINE\s+(.*)/ ) {
push(@medline,$1);
while ( defined($_ = $self->_readline) ) {
/^\s{9,}(.*)/ && do { push(@medline, $1);
next;
};
last;
}
$ref->medline(join(' ', @medline));
redo REFLOOP;
}
if( /^\s{3}PUBMED\s+(.*)/ ) {
push(@pubmed,$1);
while ( defined($_ = $self->_readline) ) {
/^\s{9,}(.*)/ && do { push(@pubmed, $1);
next;
};
last;
}
$ref->pubmed(join(' ', @pubmed));
redo REFLOOP;
}
/^REFERENCE/o && do {
$self->_add_ref_to_array(\@refs,$ref) if defined $ref;
@authors = ();
@title = ();
@loc = ();
@com = ();
@pubmed = ();
@medline = ();
$ref = Bio::Annotation::Reference->new(-tagname => 'reference');
if (/^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){
$ref->start($1);
$ref->end($2);
} elsif (/^REFERENCE\s+\d+\s+\((.*)\)/) {
$ref->gb_reference($1);
}
};
/^(FEATURES)|(COMMENT)/o && last;
$_ = undef; }
$self->_add_ref_to_array(\@refs,$ref) if defined $ref;
$$buffer = $_;
return @refs;
}
} |
sub _add_ref_to_array
{ my ($self, $refs, $ref) = @_;
my $au = $ref->authors();
my $title = $ref->title();
$au =~ s/;\s*$//g if $au;
$title =~ s/;\s*$//g if $title;
$ref->authors($au);
$ref->title($title);
push(@{$refs}, $ref);} |
sub _read_GenBank_Species
{ my ($self, $buffer) = @_;
my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown',
'Unspecified', 'Unknown', 'None', 'unclassified',
'unidentified organism', 'not supplied');
my @unkn_genus = ('unknown','unclassified','uncultured','unidentified');
my $line = $$buffer;
my( $sub_species, $species, $genus, $sci_name, $common, $class_lines,
$source_flag, $abbr_name, $organelle, $sl );
my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION);
my ($ann, $tag, $data);
while (defined($line) || defined($line = $self->_readline())) {
$line =~ s{<[^>]+>}{}g;
if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) {
($tag, $data) = ($1, $2 || '');
last if ($tag && !exists $source{$tag});
} else {
return unless $tag;
($data = $line) =~ s{^\s+}{};
chomp $data;
$tag = 'CLASSIFICATION' if ($tag ne 'CLASSIFICATION' && $tag eq 'ORGANISM' && $line =~ m{[;\.]+});
}
(exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data);
$line = undef;
}
($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM});
$$buffer = $line;
$sci_name || return;
if ($sl =~ m{^ (mitochondrion|chloroplast|plastid)? \s*(.*?) \s*(?: \( (.*?) \) )?\.? $ }xms ){
($organelle, $abbr_name, $common) = ($1, $2, $3); } else {
$abbr_name = $sl; }
my @class = map { s/^\s+//; s/\s+$//; s/\s{2,}/ /g; $_; } split /(?<!subgen)[;\.]+/, $class_lines;
my $possible_genus = quotemeta($class[-1])
. ($class[-2] ? "|" . quotemeta($class[-2]) : '');
if ($sci_name =~ /^($possible_genus)/) {
$genus = $1;
($species) = $sci_name =~ /^$genus\s+(.+)/;
}
else {
$species = $sci_name;
}
if ($species && $species =~ /(.+)\s+((?:subsp\.|var\.).+)/) {
($species, $sub_species) = ($1, $2);
}
my $unkn = grep { $_ eq $sl } @unkn_names;
return unless ($species || $genus) and $unkn == 0;
push(@class, $sci_name);
@class = reverse @class;
my $make = Bio::Species->new();
$make->scientific_name($sci_name);
$make->classification(@class) if @class > 0;
$make->common_name( $common ) if $common;
$make->name('abbreviated', $abbr_name) if $abbr_name;
$make->organelle($organelle) if $organelle;
return $make;} |
sub _read_FTHelper_GenBank
{ my ($self,$buffer) = @_;
my ($key, $loc );
my @qual = ();
if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) {
$key = $1;
$loc = $2;
while ( defined($_ = $self->_readline) ) {
if (/^(\s+)(.+?)\s*$/o) {
if (length($1) > 6) {
if (@qual || (index($2,'/') == 0)) {
push(@qual, $2);
}
else {
$loc .= $2;
}
} else {
last;
}
} else {
last;
}
}
} else {
$self->debug("no feature key!\n");
$$buffer = $self->_readline();
return;
}
$$buffer = $_;
my $out = Bio::SeqIO::FTHelper->new();
$out->verbose($self->verbose());
$out->key($key);
$out->loc($loc);
QUAL:
for (my $i = 0; $i < @qual; $i++) {
$_ = $qual[$i];
my( $qualifier, $value ) = (m{^/([^=]+)(?:=(.+))?})
or $self->warn("cannot see new qualifier in feature $key: ".
$qual[$i]);
$qualifier = '' unless( defined $qualifier);
if (defined $value) {
if (substr($value, 0, 1) eq '"') {
while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) {
if($i >= $#qual) {
$self->warn("Unbalanced quote in:\n" .
join("\n", @qual) .
"No further qualifiers will " .
"be added for this feature");
last QUAL;
}
$i++; my $next = $qual[$i];
if ($qualifier ne 'translation') {
$value .= " ";
}
$value .= $next;
}
$value =~ s/^"|"$//g;
$value =~ s/""/\"/g;
} elsif ( $value =~ /^\(/ ) { my $left = ($value =~ tr/\(/\(/); my $right = ($value =~ tr/\)/\)/);
while( $left != $right ) { if( $i >= $#qual) {
$self->warn("Unbalanced parens in:\n".
join("\n", @qual).
"\nNo further qualifiers will ".
"be added for this feature");
last QUAL;
}
$i++;
my $next = $qual[$i];
$value .= $next;
$left += ($next =~ tr/\(/\(/);
$right += ($next =~ tr/\)/\)/);
}
}
} else {
$value = '_no_value';
}
$out->field->{$qualifier} ||= [];
push(@{$out->field->{$qualifier}},$value);
}
return $out;} |
sub _write_line_GenBank
{ my ($self,$pre1,$pre2,$line,$length) = @_;
$length || $self->throw("Miscalled write_line_GenBank without length. Programming error!");
my $subl = $length - length $pre2;
my $linel = length $line;
my $i;
my $subr = substr($line,0,$length - length $pre1);
$self->_print("$pre1$subr\n");
for($i= ($length - length $pre1);$i < $linel; $i += $subl) {
$subr = substr($line,$i,$subl);
$self->_print("$pre2$subr\n");
}} |
sub _write_line_GenBank_regex
{ my ($self,$pre1,$pre2,$line,$regex,$length) = @_;
$length || $self->throw( "Miscalled write_line_GenBank without length. Programming error!");
my $subl = $length - (length $pre1) - 2;
my @lines = ();
CHUNK: while($line) {
foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) {
if($line =~ m/^(.{0,$subl})($pat)(.*)/ ) { my $l = $1.$2; $line = substr($line,length($l));
$l =~ s/\s+$//;
next CHUNK if ($l eq '');
push(@lines, $l);
next CHUNK;
}
}
$self->warn("trouble dissecting\" $line\"\n into chunks ".
"of $subl chars or less - this tag won't print right");
$line = substr($line,0,$subl) . " " . substr($line,$subl);
}
my $s = shift @lines;
$self->_print("$pre1$s\n") if $s;
foreach my $s ( @lines ) {
$self->_print("$pre2$s\n");
}} |
sub _post_sort
{ my ($obj,$value) = @_;
if( defined $value) {
$obj->{'_post_sort'} = $value;
}
return $obj->{'_post_sort'};} |
sub _show_dna
{ my ($obj,$value) = @_;
if( defined $value) {
$obj->{'_show_dna'} = $value;
}
return $obj->{'_show_dna'};} |
sub _id_generation_func
{ my ($obj,$value) = @_;
if( defined $value ) {
$obj->{'_id_generation_func'} = $value;
}
return $obj->{'_id_generation_func'};} |
sub _ac_generation_func
{ my ($obj,$value) = @_;
if( defined $value ) {
$obj->{'_ac_generation_func'} = $value;
}
return $obj->{'_ac_generation_func'};} |
sub _sv_generation_func
{ my ($obj,$value) = @_;
if( defined $value ) {
$obj->{'_sv_generation_func'} = $value;
}
return $obj->{'_sv_generation_func'};} |
sub _kw_generation_func
{ my ($obj,$value) = @_;
if( defined $value ) {
$obj->{'_kw_generation_func'} = $value;
}
return $obj->{'_kw_generation_func'};
}
1;} |
General documentation
| Where does the data go? | Top |
Data parsed in
Bio::SeqIO::genbank is stored in a variety of data
fields in the sequence object that is returned. Here is a partial list
of fields.
Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you
the top level object which defines a function called NAME() which
stores this information.
Items listed as Annotation 'NAME' tell you the data is stored the
associated Bio::AnnotationCollectionI object which is associated with
Bio::Seq objects. If it is explicitly requested that no annotations
should be stored when parsing a record of course they will not be
available when you try and get them. If you are having this problem
look at the type of SeqBuilder that is being used to contruct your
sequence object.
Comments Annotation 'comment'
References Annotation 'reference'
Segment Annotation 'segment'
Origin Annotation 'origin'
Dbsource Annotation 'dblink'
Accessions PrimarySeq accession_number()
Secondary accessions RichSeq get_secondary_accessions()
GI number PrimarySeq primary_id()
LOCUS PrimarySeq display_id()
Keywords RichSeq get_keywords()
Dates RichSeq get_dates()
Molecule RichSeq molecule()
Seq Version RichSeq seq_version()
PID RichSeq pid()
Division RichSeq division()
Features Seq get_SeqFeatures()
Alphabet PrimarySeq alphabet()
Definition PrimarySeq description() or desc()
Version PrimarySeq version()
Sequence PrimarySeq seq()
There is more information in the Feature-Annotation HOWTO about each
field and how it is mapped to the Sequence object
http://bioperl.open-bio.org/wiki/HOWTO:Feature-Annotation.
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Please direct usage questions or support issues to the mailing list:
bioperl-l@bioperl.org
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution. Bug reports can be submitted via the web:
https://redmine.open-bio.org/projects/bioperl/
| AUTHOR - Bioperl Project | Top |
bioperl-l at bioperl.org
Original author Elia Stupka, elia -at- tigem.it
Ewan Birney birney at ebi.ac.uk
Jason Stajich jason at bioperl.org
Chris Mungall cjm at fruitfly.bdgp.berkeley.edu
Lincoln Stein lstein at cshl.org
Heikki Lehvaslaiho, heikki at ebi.ac.uk
Hilmar Lapp, hlapp at gmx.net
Donald G. Jackson, donald.jackson at bms.com
James Wasmuth, james.wasmuth at ed.ac.uk
Brian Osborne, bosborne at alum.mit.edu
Chris Fields, cjfields at bioperl dot org
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _