Introduction

The Bio.SeqLoc modules in seqloc are designed to represent positions and locations (ranges of positions) on sequences, particularly nucleotide sequences. My original motivation for writing these packages was handing the locations of genes in eukaryotic genomes.

Strands

Handling forward and reverse-complement locations and sequences is a very common task in bioinformatics. The Bio.SeqLoc.Strand package handles strandedness in seqloc. It consists of a simple Strand enumerated data type and a Stranded typeclass of any object that has a strandedness and can therefore be meaningfully reverse complemented. String-like objects such as String itself and the various ByteString types instantiate Stranded by reversing the string and complementing nucleotide characters.

Prelude> :set prompt "ghci> "
ghci> :set -XOverloadedStrings
ghci> import Bio.SeqLoc.Strand
ghci> revCompl "GAttACA"
"TGTaaTC"
ghci> stranded Plus "GATTaca"
"GATTaca"
ghci> stranded Minus "GATTaca"
"tgtAATC"
ghci> stranded (revCompl Plus) "GATTaca"
"tgtAATC"

Positions and Offsets

ghci> import Bio.SeqLoc.Position

In the seqloc package, an Offset is a 0-based index into a sequence and a Position is an Offset plus a Strand indicating the strand on which the position occurs in the sequence.

Displaying Positions and Locations

ghci> import Bio.SeqLoc.LocRepr
ghci> :m +Data.ByteString.Char8

The LocRepr typeclass provides an interface for representing position and location data types in a format that is easy to read as well as to parse. There are two basic functions, repr which produces a string representation and unrepr which is an extremely lightweight parser for that string representation from Data.Attoparsec.Zepto. There are also helper functions that wrap the parser and handle errors in different ways.

ghci> repr (Pos 99 Plus)
"99(+)"
ghci> (unreprEither  "99(-)") :: Either String Pos
Right (Pos {offset = Offset {unOffset = 99}, strand = Minus})
ghci> (unreprErr  "99(+)") :: Pos
Pos {offset = Offset {unOffset = 99}, strand = Plus}

While locations and positions have Show instances, their LocRepr instances have advantages for human and computer legibility in many contexts.

Contiguous Locations

ghci> import Bio.SeqLoc.Location as Loc

The ContigLoc type represents a contiguous sequence location, such as the forward strand from nucleotides 100 to 150, or the reverse complement strand from nucleotides 1000 to 800. These locations can be created by specifying their bounds and strand.

ghci> repr $ Loc.fromBoundsStrand 100 150 Plus
"100to150(+)"
ghci> repr $ Loc.fromBoundsStrand 800 1000 Minus
"800to1000(-)"

They can also be specified with a starting position, which will be the beginning of the location in its strand, and a length.

ghci> repr $ Loc.fromPosLen (unreprErr "100(+)") 51
"100to150(+)"
ghci> repr $ Loc.fromPosLen (unreprErr "1000(-)") 201
"800to1000(-)"

Finally, they can be specified from their starting and ending position, in which case the strand is deduced from the order of the two positions.

ghci> repr $ Loc.fromStartEnd 100 150
"100to150(+)"
ghci> repr $ Loc.fromStartEnd 1000 800
"800to1000(-)"

The ContigLoc type is an instance of the Location typeclass, which provides numerous useful functions. It's important to remember that the starting position of a (-) strand location has a higher offset than the ending position.

ghci> let l = Loc.fromStartEnd 1000 800
ghci> let l' = revCompl l
ghci> Loc.bounds l
(Offset {unOffset = 800},Offset {unOffset = 1000})
ghci> repr $ Loc.startPos l
"1000(-)"
ghci> repr $ Loc.endPos l
"800(-)"
ghci> Loc.bounds l'
(Offset {unOffset = 800},Offset {unOffset = 1000})
ghci> repr $ Loc.startPos l'
"800(+)"
ghci> repr $ Loc.endPos l'
"1000(+)"

The Location typeclass also allows us to convert between a position in absolute coordinates and a position relative to a location. The posInto function takes an absolute position into a location-relative position. It may fail, with Nothing, if the position is outside the location.

ghci> let p = Pos 850 Plus
ghci> maybe "n/a" repr $ Loc.posInto p l
"150(-)"
ghci> maybe "n/a" repr $ Loc.posInto p l'
"50(+)"
ghci> let p2 = Pos 750 Plus
ghci> maybe "n/a" repr $ Loc.posInto p2 l
"n/a"

The posOutof function pulls a location-relative position back out of the location.

ghci> let q = Pos 120 Plus
ghci> maybe "n/a" repr $ Loc.posOutof q l
"880(-)"
ghci> maybe "n/a" repr $ Loc.posOutof q l'
"920(+)"
ghci> let q2 = Pos 220 Plus
ghci> maybe "n/a" repr $ Loc.posOutof q2 l'
"n/a"

Entire locations can be mapped back in a similar way. Here we find a sub-location from nucleotides 100 through 150 within the enclosing location of 1000 to 800, map the sub-location back to its absolute coordinates, and then find its relative coordinates within the complementary location.

ghci> let k = Loc.fromStartEnd 100 150
ghci> maybe "n/a" repr $ Loc.clocOutof k l
"850to900(-)"
ghci> maybe "n/a" repr $ Loc.clocInto (unreprErr "850to900(-)") l'
"50to100(-)"

Sequence Data

ghci> import Bio.SeqLoc.SeqLike as SeqLike

The SeqLike typeclass in the Bio.SeqLoc.SeqLike module has a simple interface to allow the extraction of subsequences based on locations. There are instances for String and for lazy and strict ByteString types. Recall that offsets are all 0-based indices and that fromStartEnd creates a location that includes both endpoints.

ghci> Loc.seqData "GATTACA" (Loc.fromStartEnd 2 4)
Just "TTA"
ghci> Loc.seqData "GATTACA" (Loc.fromStartEnd 2 8)
Nothing
ghci> Loc.seqDataPad "GATTACA" (Loc.fromStartEnd 2 8)
"TTACANN"
ghci> Loc.seqDataPad "GATTACA" (Loc.fromStartEnd (-2) 4)
"NNGATTA"
ghci> Loc.seqDataPad "GATTACA" (Loc.fromStartEnd 6 0)
"TGTAATC"

The instances for String and lazy ByteString avoid evaluating the full sequence whenever possible, but the use of functions such as length will force its evaluation.

Spliced Locations

ghci> import Bio.SeqLoc.SpliceLocation as SpLoc

The SpliceLoc type in the Bio.SeqLoc.SpliceLocation package provides spliced locations, designed to model the structure of eukaryotic genes as a series of individual ContigLoc locations lying in order on the same strand.

ghci> maybe "n/a" repr $ SpLoc.fromContigs [ Loc.fromStartEnd 100 150, Loc.fromStartEnd 200 250 ]
"100to150(+);200to250(+)"
ghci> maybe "n/a" repr $ SpLoc.fromContigs [ Loc.fromStartEnd 100 150, Loc.fromStartEnd 250 200 ]
"n/a"
ghci> maybe "n/a" repr $ SpLoc.fromContigs [ Loc.fromStartEnd 100 150, Loc.fromStartEnd 50 80 ]
"n/a"

SpliceLoc implements the Location interface as well. When pulling a contiguous location out of a spliced location, the result may also be spliced. When pushing a contiguous location into a spliced location, it must fit entirely within a single segment of the spliced location.

ghci> let (Just s) = SpLoc.fromContigs [ Loc.fromStartEnd 100 150, Loc.fromStartEnd 200 250 ]
ghci> maybe "n/a" repr $ Loc.clocOutof (Loc.fromStartEnd 25 75) s
"125to150(+);200to224(+)"
ghci> maybe "n/a" repr $ Loc.clocInto (Loc.fromStartEnd 210 240) s
"61to91(+)"
ghci> maybe "n/a" repr $ Loc.clocInto (Loc.fromStartEnd 190 240) s
"n/a"

The module also provides specialized functions for finding the coordinates of spliced locations relative to an enclosing spliced location. These will properly merge locations whose relative coordinates are adjacent

ghci> let (Just u) = Loc.clocOutof (Loc.fromStartEnd 20 81) s
ghci> repr u
"120to150(+);200to230(+)"
ghci> let (Just t) = SpLoc.fromContigs [ Loc.fromStartEnd 120 150, Loc.fromStartEnd 200 230 ]
ghci> maybe "n/a" repr $ SpLoc.locInto t s
"20to81(+)"
ghci> let (Just t') = SpLoc.fromContigs [ Loc.fromStartEnd 120 150, Loc.fromStartEnd 210 230 ]
ghci> maybe "n/a" repr $ SpLoc.locInto t' s
"20to50(+);61to81(+)"

Named Sequences

ghci> import Bio.SeqLoc.OnSeq
ghci> import qualified Data.ByteString.Char8 as BS

Genome annotation data files typically express the location of a gene as a [spliced] location on one of several chromosomes. The OnSeq type in the Bio.SeqLoc.OnSeq allows position and location types to be tagged with names. The module provides useful type synonyms for the named location data types.

ghci> let z = OnSeq (toSeqLabel "chr1") (Loc.fromStartEnd 10000 20000)
ghci> repr z
"chr1@10000to20000(+)"
ghci> repr $ revCompl z
"chr1@10000to20000(-)"