Bio::Assembly::IO
sam
Toolbar
Summary
Bio::Assembly::IO::sam - An IO module for assemblies in Sam format *BETA*
Package variables
Privates (from "my" definitions)
$progname = 'sam'
Included modules
Carp
File::Basename
File::Spec
File::Temp qw ( tempfile )
Inherit
Synopsis
$aio = Bio::Assembly::IO( -file => "mysam.bam",
-refdb => "myrefseqs.fas");
$assy = $aio->next_assembly;
Description
Methods
Methods description
Title : next_assembly Usage : my $scaffold = $asmio->next_assembly(); Function: return the next assembly in the sam-formatted stream Returns : Bio::Assembly::Scaffold object Args : none |
Title : next_contig Usage : my $contig = $asmio->next_contig(); Function: return the next contig or singlet from the sam stream Returns : Bio::Assembly::Contig or Bio::Assembly::Singlet Args : none |
Title : write_assembly Usage : Function: not implemented (module currrently read-only) Returns : Args : |
Title : _store_contig Usage : my $contigobj = $self->_store_contig(\%contiginfo); Function: create and load a contig object Returns : Bio::Assembly::Contig object Args : Bio::DB::Sam::Segment object |
Title : _store_read Usage : my $readobj = $self->_store_read($readobj, $contigobj); Function: store information of a read belonging to a contig in a contig object Returns : Bio::LocatableSeq Args : Bio::DB::Bam::AlignWrapper, Bio::Assembly::Contig |
Title : _store_singlet Usage : my $singletobj = $self->_store_singlet($contigobj); Function: convert a contig object containing a single read into a singlet object Returns : Bio::Assembly::Singlet Args : Bio::Assembly::Contig (previously loaded with only one seq) |
Title : _init_sam Usage : $self->_init_sam($fasfile) Function: obtain a Bio::DB::Sam parsing of the associated sam file Returns : true on success Args : [optional] name of the fasta reference db (scalar string) Note : The associated file can be plain text (.sam) or binary (.bam); If the fasta file is not specified, and no file is contained in the refdb() attribute, a .fas file with the same basename as the sam file will be searched for. |
Title : _get_contig_segs_from_coverage Usage : Function: calculates separate contigs using coverage info in the segment Returns : array of Bio::DB::Sam::Segment objects, representing each contig Args : Bio::DB::Sam::Segment object |
Title : _calc_consensus_quality Usage : @qual = $aio->_calc_consensus_quality( $contig_seg ); Function: calculate an average or other data-reduced quality over all sites represented by the features contained in a Bio::DB::Sam::Segment Returns : Args : a Bio::DB::Sam::Segment object |
Title : _calc_consensus Usage : @qual = $aio->_calc_consensus( $contig_seg ); Function: calculate a simple quality-weighted consensus sequence for the segment Returns : a SeqWithQuality object Args : a Bio::DB::Sam::Segment object |
Title : refdb Usage : $obj->refdb($newval) Function: the name of the reference db fasta file Example : Returns : value of refdb (a scalar) Args : on set, new value (a scalar or undef, optional) |
Title : _segset Usage : $segset_hashref = $self->_segset() Function: hash container for the Bio::DB::Sam::Segment objects that represent each set of contigs for each seq_id { $seq_id => [@contig_segments], ... } Example : Returns : value of _segset (a hashref) if no arg, or the arrayref of contig segments, if arg == a seq id Args : [optional] seq id (scalar string) Note : readonly; hash elt set in _init_sam() |
Title : _current_refseq_id Usage : $obj->_current_refseq_id($newval) Function: the "current" reference sequence id Example : Returns : value of _current_refseq (a scalar) Args : on set, new value (a scalar or undef, optional) |
Title : current_contig_seg_idx Usage : $obj->current_contig_seg_idx($newval) Function: the "current" segment index in the "current" refseq Example : Returns : value of current_contig_seg_idx (a scalar) Args : on set, new value (a scalar or undef, optional) |
Title : sam Usage : $obj->sam($newval) Function: store the associated Bio::DB::Sam object Example : Returns : value of sam (a Bio::DB::Sam object) Args : on set, new value (a scalar or undef, optional) |
Methods code
BEGIN { unless ( eval "require Bio::DB::Sam; 1" ) {
Bio::Root::Root->throw(__PACKAGE__.' requires installation of samtools (libbam) and Bio::DB::Sam (available on CPAN; not part of BioPerl)');
}
unless ( eval "require IO::Uncompress::Gunzip;\$ HAVE_IO_UNCOMPRESS = 1") {
Bio::Root::Root->warn("IO::Uncompress::Gunzip is not available; you'll have to do your decompression by hand."); } |
sub new
{ my $class = shift;
my @args = @_;
my $self = $class->SUPER::new(@args);
my ($file, $refdb, $format) = $self->_rearrange([qw(FILE REFDB FORMAT)], @args);
$self->file($file);
$refdb && $self->refdb($refdb);
$self->_init_sam() or croak( "Sam initialization failed" );
$self->{_assigned} = {};
return $self;} |
sub next_assembly
{ my $self = shift;
my @contig_set;
my @refseq_ids = $self->sam->seq_ids;
my $assembly = Bio::Assembly::Scaffold->new( -progname => $progname );
foreach my $id (@refseq_ids) {
$self->_current_refseq_id( $id );
while ( my $obj = $self->next_contig()) {
if ($obj->isa('Bio::Assembly::Singlet')) { $assembly->add_singlet($obj);
} else { $assembly->add_contig($obj);
}
}
}
return $assembly;} |
sub next_contig
{ my $self = shift;
if (!defined $self->_current_contig_seg_idx) {
$self->_current_contig_seg_idx(0);
}
else {
$self->_current_contig_seg_idx( 1+$self->_current_contig_seg_idx );
}
unless ($self->_current_refseq_id) {
croak("No current refseq id set");
}
my $contig_segs = $self->_segset($self->_current_refseq_id);
unless ($contig_segs && @$contig_segs) {
croak("No contig segset for current id '".$self->_current_refseq_id."'")
}
my $contig_seg = $$contig_segs[$self->_current_contig_seg_idx];
return if (!defined $contig_seg); my $seqio = $contig_seg->features(-iterator => 1);
my $contigobj = $self->_store_contig($contig_seg);
my $numseq = 0;
while ( my $read = $seqio->next_seq ) {
if ($self->{_assigned}->{$read->name}) {
next;
}
$self->{_assigned}->{$read->name}=1;
$self->_store_read($read, $contigobj);
$numseq++;
}
if ($numseq == 1) { $contigobj = $self->_store_singlet($contigobj);
}
return $contigobj;} |
sub write_assembly
{ my $self = shift;
$self->throw( "This module is currently read-only" );} |
sub _store_contig
{ my ($self, $contig_seg) = @_;
my $contigobj = Bio::Assembly::Contig->new(
-id => 'sam_assy['.$self->_basename.':'.$self->_current_refseq_id.']/'.$contig_seg->start.'-'.$contig_seg->end,
-source => $progname,
-strand => 1
);
my $consobj = $self->_calc_consensus($contig_seg);
my $consensus_seq = Bio::LocatableSeq->new(
-id => $contigobj->id,
-seq => $consobj->seq,
-start => 1,
);
$contigobj->set_consensus_sequence($consensus_seq);
my $consensus_qual = Bio::Seq::PrimaryQual->new(
-id => $contigobj->id,
-qual => $consobj->qual,
-start => 1,
);
$contigobj->set_consensus_quality($consensus_qual);
return $contigobj;} |
sub _store_read
{ my $self = shift;
my ($read, $contigobj) = @_;
my $readseq = Bio::LocatableSeq->new(
-display_id => $read->name,
-primary_id => $read->name,
-seq => $read->dna,
-start => 1,
-strand => $read->strand,
-alphabet => 'dna'
);
my $qual = Bio::Seq::PrimaryQual->new(
-id => $read->name."_qual",
-qual => [$read->qscore]
);
my @pair_info;
if ($read->proper_pair) { @pair_info = (
mate_start => $read->mate_start,
mate_len => $read->mate_len,
mate_strand => $read->mstrand,
insert_size => $read->isize
);
}
my $alncoord = Bio::SeqFeature::Generic->new(
-primary => $read->name,
-start => $read->start,
-end => $read->end,
-strand => $read->strand,
-qual => join(' ',$read->qscore),
-tag => { 'contig' => $contigobj->id,
'cigar' => $read->cigar_str,
@pair_info }
);
$contigobj->set_seq_coord($alncoord, $readseq);
$contigobj->set_seq_qual( $readseq, $qual );
return $readseq;} |
sub _store_singlet
{ my $self = shift;
my ($contigobj) = @_;
my ($readseq) = $contigobj->each_seq;
my $singletobj = Bio::Assembly::Singlet->new( -id => $contigobj->id,
-seqref => $readseq );
return $singletobj;} |
sub _init_sam
{ my $self = shift;
my $fasfile = shift;
my $file = $self->file;
my $sam;
$fasfile ||= $self->refdb;
$file =~ s/^[<>+]*//; my ($fname, $dir, $suf) = fileparse($file, ".sam", ".bam");
$self->_set_from_args({ '_basename' => $fname },
-methods => [qw( _basename)],
-create => 1);
if (!defined $fasfile) {
for (qw( fas fa fasta fas.gz fa.gz fasta.gz )) {
$fasfile = File::Spec->catdir($dir, $self->_basename.$_);
last if -e $fasfile;
undef $fasfile;
}
}
unless (-e $fasfile) {
croak( "Can't find associated reference fasta db" );
}
!$self->refdb && $self->refdb($fasfile);
if ($fasfile =~ /\.gz[^.]*$/) {
unless ($HAVE_IO_UNCOMPRESS) {
croak( "IO::Uncompress::Gunzip not available; can't decompress on the fly");
}
my ($tfh, $tf) = tempfile( UNLINK => 1);
my $z = IO::Uncompress::Gunzip->new($fasfile) or croak("Can't expand: $@");
while (<$z>) { print $tfh $_ }
close $tfh;
$fasfile = $tf;
}
if ($file =~ /\.gz[^.]*$/) {
unless ($HAVE_IO_UNCOMPRESS) {
croak( "IO::Uncompress::Gunzip not available; can't decompress on the fly");
}
my ($tfh, $tf) = tempfile( UNLINK => 1);
my $z = IO::Uncompress::Gunzip->new($file) or croak("Can't expand: $@");
while (<$z>) {
print $tfh $_;
}
close $tfh;
$file = $tf;
}
if (-T $file) {
my $bam = $file;
$bam =~ s/\.sam/\.bam/;
croak( "'$file' looks like a text file.\n\tTo convert to the required .bam (binary SAM) format, run\n\t\$ samtools view -Sb $file > $bam\n");
}
$sam = Bio::DB::Sam->new( -bam => $file,
-fasta => $fasfile,
-autoindex => 1,
-expand_flags => 1);
unless (defined $sam) {
croak( "Couldn't create the Bio::DB::Sam object" );
}
$self->{sam} = $sam;
for ($sam->seq_ids) {
my $seg = $sam->segment(-seq_id=>$_, -start=>1, -end=>$sam->length($_));
${$self->{_segset}}{$_} = [$self->_get_contig_segs_from_coverage($seg)];
}
return 1;} |
sub _get_contig_segs_from_coverage
{ my $self = shift;
my $segment = shift;
unless ($self->sam) {
croak("Sam object not yet initialized (call _init_sam)");
}
unless ( ref($segment) =~ /Bio::DB::Sam::Segment/ ) {
croak("Requires Bio::DB::Sam::Segment object at arg 1");
}
my ($cov, @covdata, @rngs, @segs);
($cov) = $segment->features('coverage');
unless ($cov) {
croak("No coverage data!");
}
@covdata = $cov->coverage;
my $in_contig;
my ($lf_end,$rt_end);
for (0..$#covdata) {
if ($covdata[$_]) {
if ($in_contig) {
$rt_end = $_+1;
next;
}
else {
$in_contig = 1;
if (defined $lf_end && defined $rt_end) {
push @rngs, [$lf_end, $rt_end];
}
$lf_end = $_+1;
}
}
else {
$in_contig = 0;
}
}
push @rngs, [$lf_end, $rt_end] if (defined $lf_end and
defined $rt_end and
$lf_end <= $rt_end);
unless (@rngs) {
carp ("No coverage for this segment!");
return;
}
for (@rngs) {
push @segs, $self->sam->segment(-seq_id=>$segment->seq_id,
-start=>$$_[0],
-end=>$$_[1]);
}
return @segs;} |
sub _calc_consensus_quality
{ my $self = shift;
my $seg = shift;
my @quals;
my $region = $seg->seq_id.':'.$seg->start.'..'.$seg->end;
my $qual_averager = sub {
my ($seqid, $pos, $p) = @_;
return unless ($seg->start <= $pos and $pos <= $seg->end);
my $acc = 0;
my $n = 0;
for my $pileup (@$p) {
my $b = $pileup->alignment;
$acc += $b->qscore->[$pileup->qpos];
$n++;
}
push @quals, int($acc/$n); };
$self->sam->pileup($region, $qual_averager);
return @quals; } |
sub _calc_consensus
{ my $self = shift;
my $seg = shift;
my @quals;
my $conseq ='';
my $region = $seg->seq_id.':'.$seg->start.'-'.$seg->end;
my $weighted_consensus = sub {
my ($seqid, $pos, $p) = @_;
return unless ($seg->start <= $pos and $pos <= $seg->end);
my %wt_tbl;
my %n;
for my $pileup (@$p) {
my $b = $pileup->alignment;
my $res = substr($b->qseq,$pileup->qpos,1);
$wt_tbl{$res} += $b->qscore->[$pileup->qpos] || 0;
$n{$res} ||= 0;
$n{$res}++;
}
my $c = (sort { $wt_tbl{$b}<=>$wt_tbl{$a} } keys %wt_tbl)[0];
$conseq .= $c;
push @quals, int($wt_tbl{$c}/$n{$c}); };
$self->sam->pileup($region, $weighted_consensus);
return Bio::Seq::Quality->new(
-qual => join(' ', @quals ),
-seq => $conseq,
-id => $region
); } |
sub refdb
{ my $self = shift;
return $self->{'refdb'} = shift if @_;
return $self->{'refdb'};} |
sub _segset
{ my $self = shift;
return $self->{'_segset'} unless @_;
return ${$self->{'_segset'}}{shift()};} |
sub _current_refseq_id
{ my $self = shift;
return $self->{'_current_refseq_id'} = shift if @_;
return $self->{'_current_refseq_id'};} |
sub _current_contig_seg_idx
{ my $self = shift;
return $self->{'_current_contig_seg_idx'} = shift if @_;
return $self->{'_current_contig_seg_idx'};} |
sub sam
{ my $self = shift;
return $self->{'sam'};} |
sub DESTROY
{ my $self = shift;
undef $self->{'sam'};
delete $self->{_segset}->{$_} foreach (keys %{$self->{_segset}});
}
1;} |
General documentation
* Required files
A binary SAM (.bam) alignment and a reference sequence database in
FASTA format are required. Various required indexes (.fai, .bai)
will be created as necessary (via
Bio::DB::Sam).
* Compressed files
...can be specified directly , if
IO::Uncompress::Gunzip is
installed. Get it from your local CPAN mirror.
* BAM vs. SAM
The input alignment should be in (possibly gzipped) binary SAM
(.bam) format. If it isn't, you will get a message explaining how
to convert it, viz.:
$ samtools view -Sb mysam.sam > mysam.bam
The bam file must also be sorted on coordinates: do
$ samtools sort mysam.unsorted.bam > mysam.bam
* Contigs
Contigs are calculated by this module, using the 'coverage' feature of
the
Bio::DB::Sam object. A contig represents a contiguous portion
of a reference sequence having non-zero coverage at each base.
The bwa assembler (
http://bio-bwa.sourceforge.net/) can assign read
sequences to multiple reference sequence locations. The present
implementation currently assigns such reads only to the first contig
in which they appear.
* Consensus sequences
Consensus sequence and quality objects are calculated by this module,
using the pileup callback feature of Bio::DB::Sam. The consensus
is (currently) simply the residue at a position that has the maximum
sum of quality values. The consensus quality is the integer portion of
the simple average of quality values for the consensus residue.
* SeqFeatures
Read sequences stored in contigs are accompanied by the following
features:
contig : name of associated contig
cigar : CIGAR string for this read
If the read is paired with a successfully mapped mate, these features
will also be available:
mate_start : coordinate of to which the mate was aligned
mate_len : length of mate read
mate_strand : strand of mate (-1 or 1)
insert_size : size of insert spanned by the mate pair
These features are obtained as follows:
@ids = $contig->get_seq_ids;
$an_id = $id[0]; # or whatever
$seq = $contig->get_seq_by_name($an_id);
# Bio::LocatableSeq's aren't SeqFeature containers, so...
$feat = $contig->get_seq_feat_by_tag(
$seq, "_aligned_coord:".$s->id
);
($cigar) = $feat->get_tag_values('cigar');
# etc.
* Supporting both text SAM (TAM) and binary SAM (BAM)
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to
the Bioperl mailing list. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Please direct usage questions or support issues to the mailing list:
bioperl-l@bioperl.orgrather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
Report bugs to the Bioperl bug tracking system to help us keep track
of the bugs and their resolution. Bug reports can be submitted via
the web:
https://redmine.open-bio.org/projects/bioperl/
| AUTHOR - Mark A. Jensen | Top |
Email maj -at- fortinbras -dot- us
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
| Bio::Assembly::IO compliance | Top |