Welcome to cgr_view’s documentation!¶
Indices and tables¶
cgr
A module for creating, saving and drawing k-mer matrices and Chaos Game Representations (CGRs) of nucleotide sequences
Prerequisites¶
- Jellyfish
An external program for counting k-mers. Must be accessible on the path. You can install from conda as follows:
conda install -c bioconda jellyfish
Quickstart¶
Input fasta file, get cgr
- one cgr for each entry in the fasta file
cgr.from_fasta("my_seqs.fa", outfile = "my_cgrs", k = 7)
- just one cgr with all entries in the fasta file (eg for genomes and contigs)
cgr.from_fasta("my_genome.fa", outfile = "genome_cgr", k = 7, as_single = True)
Workflow:¶
- make kmer count db in Jellyfish from fasta -> generate cgr from db.
- optionally merge cgrs into single cgr as separate channels
- stack all composed cgrs into an array of cgrs
- save as numpy binary (.npy) files
Usage:¶
Import module
import cgr
Make kmer count db
cgr.run_jellyfish("test_data/NC_012920.fasta", 11, "11mer.jf") cgr.run_jellyfish("test_data/NC_012920.fasta", 10, "10_mer.jf")
Load CGRs from kmer count db
cgr1 = cgr.cgr_matrix("/Users/macleand/Desktop/athal-5-mers.jf") cgr2 = cgr.cgr_matrix("test_data/five_mer.jf")
Draw a cgr and save to file
- just one cgr, can choose colour (value of ‘h’) and which channel to put cgr in
cgr.draw_cgr(cgr1, h = 0.64, v = 1.0, out = "my_cgr.png", resize = 1000, main = "s" )
- two cgrs, first in tuple goes in ‘h’, second goes in ‘s’. Can set ‘v’
cgr.draw_cgr( (cgr1, cgr1), v = 1.0, out = "two_cgrs.png")
- three cgrs ‘h’,’s‘ and ‘v’ are assigned as order in tuple
cgr.draw_cgr( (cgr1, cgr1, cgr1) )
Save a single cgr into a text file
cgr.save_as_csv(cgr1, file = "out.csv")
Join n cgrs into one, extending the number of channels …
merged_cgr = cgr.join_cgr( (cgr1, cgr2, ... ) )
Write to file (numpy binary)
cgr.save_cgr("my_cgr, merged_cgr )
- Input fasta file, get cgr
- one cgr for each entry in the fasta file
cgr.from_fasta("my_seqs.fa", outfile = "my_cgrs", k = 7)
- just one cgr with all entries in the fasta file (eg for genomes and contigs)
cgr.from_fasta("my_genome.fa", outfile = "genome_cgr", k = 7, as_single = True)
-
cgr.
blocky_scale
(im: numpy.ndarray, nR: int, nC: int) → numpy.ndarray¶ Upscales an array in preparation for drawing. By default the array is a square with sqrt(k ** 4) wide and high. For many values of k this will be too small to view well on a monitor. This function does a scale operartion that increases the size of the image by simply increasing the pixels in each square.
Param: im numpy.ndarray – the image to be scaled Param: nR int – the number of height pixels to be in the final image Param: nC int – the number of width pixels to be in the final image Returns: numpy.ndarray – upscaled image
-
cgr.
cgr_matrix
(jellyfish: str) → scipy.sparse.dok.dok_matrix¶ Main function, creates the cgr matrix, a sparse matrix of type scipy.sparse.dok_matrix
Runs the cgr process on a jellyfish file and returns a scipy.sparse.dok_matrix object of the CGR with dtype int32 Only observed kmers are represented, absent coordinates mean 0 counts for the kmer at that coordinate.
Param: jellyfish str – jellyfish DB file Returns: scipy.sparse.dok_matrix – sparse matrix of kmer counts
-
cgr.
draw
(rgb: numpy.ndarray) → None¶ renders RGB array on the screen.
Param: rgb numpy.ndarray – RGB channel image
-
cgr.
draw_cgr
(cgr_matrices: scipy.sparse.dok.dok_matrix, h: float = 0.8, s: float = 0.5, v: float = 1.0, main: str = 's', show: bool = True, write: bool = True, out: str = 'cgr.png', resize: bool = False) → None¶ Draws cgrs to a file. Allows user to set which of up to 3 provided cgr matrices goes in at which of the H, S or V image channels. Typically for one channel, select h to specify the image colour and set cgr as s to change that colour according to counts in cgr. Set v to 1.0 for maximum brightness.
Param: cgr_matrices scipy.sparse.dok_matrix or tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn. Tuple provides order for HSV channels of image. Param: h float – (0..1) value for h channel if not used for cgr data Param: s float – (0..1) value for s channel if not used for cgr data Param: v float – (0..1) value for v channel if not used for cgr data Param: main str – the channel to place the cgr matrix in if a single cgr matrix is passed Param: show bool – render CGR picture to screen Param: write – write CGR picture to file Param: out str – filename to write to Param: resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height Returns: None
-
cgr.
draw_single_cgr
(cgr_matrix, h=0.8, s=0.5, v=1.0, main='s', show=True, write=True, out='cgr.png', resize=False)¶ draws a single cgr image, selecting channels and resizing as appropriate
Param: cgr_matrix scipy.sparse.dok_matrix to be drawn. Param: h float – (0..1) value for h channel if not used for cgr data Param: s float – (0..1) value for s channel if not used for cgr data Param: v float – (0..1) value for v channel if not used for cgr data Param: main str – the channel to place the cgr matrix in Param: show bool – render CGR picture to screen Param: write – write CGR picture to file Param: out str – filename to write to Param: resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height Returns: None
-
cgr.
draw_three_cgrs
(cgr_matrices, show=True, write=True, out='cgr.png', resize=False)¶ Draws a tuple of 3 cgr matrices as an image
Param: cgr_matrices tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn. Tuple provides order for HSV channels of image Param: show bool – render CGR picture to screen Param: write – write CGR picture to file Param: out str – filename to write to Param: resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height Returns: None
-
cgr.
draw_two_cgrs
(cgr_matrices, v=1.0, show=True, write=True, out='cgr.png', resize=False)¶ draws two cgr matrices into a single image. first matrix of tuple becomes h channel, second of tuple becomes v channel
Param: cgr_matrices tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn. Param: v float – (0..1) value for v channel Param: show bool – render CGR picture to screen Param: write – write CGR picture to file Param: out str – filename to write to Param: resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height Returns: None
-
cgr.
estimate_genome_size
(fasta: str) → int¶ Guesses genome size from fasta file size, assumes 1 byte ~= 1 base
Param: fasta str – a fasta file Returns: int – approximate genome size in nucleotides
-
cgr.
from_fasta
(fasta_file: str, outfile: str = 'my_cgrs', as_single: bool = False, k: int = 7) → None¶ Factory function to load in a FASTA file and generate a binary .npy of CGRs
Parameters: - fasta_file – str FASTA file to load
- outfile – str outfile to save
- as_single – bool If True treats all entries as single sequence and return one CGR. If False, treats all entries individually and returns many CGR
- k – int length of kmer to use
Returns: None
-
cgr.
get_coord
(kmer: str) → List[int]¶ given a kmer gets the coordinates of the box position in the cgr grid, returns as list [x,y] of coordinates
Param: kmer str – a string of nucleotides Returns: coords [x,y] – the x,y positions of the nucleotides in the cgr
-
cgr.
get_grid_size
(k: int) → int¶ returns the grid size (total number of elements for a cgr of k length kmers
Param: k int – the value of k to be used Returns: int – the total number of elements in the grid
-
cgr.
get_k
(jellyfish: str) → int¶ asks the jellyfish file what value was used for k
Param: jellyfish str – jellyfish DB file Returns: int – length of k used
-
cgr.
get_kmer_list
(jellyfish: str) → Generator[List[T], str, None]¶ runs jellyfish dump on a Jellyfish DB. Captures output as a generator stream. Each item returned is a list [kmer: str, count: str]
Param: jellyfish str – a Jellyfish DB file Returns: Generator – a list of [kmer string, times_kmer_seen]
-
cgr.
get_max_count
(jellyfish) → int¶ estimates the count of the most represented kmer in the jellyfish file by using the last bucket of the :param jellyfish: :return: int estimated count of the most represented kmer
-
cgr.
is_cgr_matrix
(obj) → bool¶ returns true if obj is a scipy.sparse.dok.dok_matrix object
-
cgr.
join_cgr
(cgrs: tuple) → numpy.ndarray¶ Takes tuple of cgrs of shape (n,n) and returns one stacked array of size (n,n, len(cgrs) )
Param: cgrs tuple – tuple of cgrs to be joined Returns: numpy.ndarray
-
cgr.
load_npy
(file: str) → numpy.ndarray¶ loads numpy .npy file as ndarray. Useful for restoring collections of cgrs but resulting array is not compatible directly with drawing methods here.
:param file str – numpy .npy file to load :return: numpy.ndarray
-
cgr.
make_blanks_like
(a: scipy.sparse.dok.dok_matrix, h: float = 1.0, s: float = 1.0, v: float = 1.0) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶ returns tuple of numpy.ndarrays with default values of h,s and v of shape of a
Param: a scipy.sparse.dok_matrix – a cgr matrix to make blanks like Param: h float – the values with which to fill the first numpy.ndarray Param: s float – the values with which to fill the second numpy.ndarray Param: v float – the values with which to fill the third numpy.ndarray Returns: Tuple of numpy.ndarray
-
cgr.
many_seq_record_to_many_cgr
(seq_record: <module 'Bio.SeqIO.FastaIO' from '/home/docs/checkouts/readthedocs.org/user_builds/cgr-view/envs/latest/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py'>, k: int) → scipy.sparse.dok.dok_matrix¶ Parameters: - seq_record – Bio.SeqIO FASTA record
- k – int size of k to use
Returns: scipy.sparse.dok_matrix
-
cgr.
many_seq_record_to_one_cgr
(fa_file: str, k: int) → scipy.sparse.dok.dok_matrix¶ Reads many sequence records in a FASTA file into a single CGR matrix, treating all sequence records as if they are one sequence, EG as if for a genome sequence in Chromosomes. :param fa_file: str FASTA FILE name :param k: int length of k to use :return: scipy.sparse.dok_matrix
-
cgr.
resize_rgb_out
(rgb: numpy.ndarray, resize: int) → numpy.ndarray¶ given an rgb image in one pixel per kmer size, increases size so that the resulting image is resize * resize pixels
Param: rgb numpy.ndarray – an RGB image array Param: resize – pixel width (and therefore height) of resulting image Returns: numpy.ndarray – resized image with shape (resize, resize)
-
cgr.
run_jellyfish
(fasta: str, k: int, out: str) → int¶ runs Jellyfish on fasta file using k kmer size, produces Jellyfish db file as side effect.
Param: fasta str – a fasta file Param: k int – size of kmers to use Param: out str – file in which to save kmer db Returns: int – return code of Jellyfish subprocess
-
cgr.
save_as_csv
(cgr_matrix: scipy.sparse.dok.dok_matrix, file: str = 'cgr_matrix.csv', delimiter: str = ', ', fmt: str = '%d')¶ Writes simple 1 channel cgr matrix to CSV file.
See also numpy.savetxt
Param: cgr_matrix scipy.sparse.dok_matrix – cgr_matrix to save Param: file str – filename to write to Param: delimiter str – column separator character Param: fmt str – text format string Returns: None
-
cgr.
save_cgr
(cgr_obj: numpy.ndarray, outfile: str = 'cgr') → None¶ Saves cgr_obj as numpy .npy file. cgr_obj one or more dimensional numpy.ndarray. saves as ndarray not dokmatrix, so can be loaded in regular numpy as collections of cgrs
Parameters: - cgr_obj – numpy.ndarray constructed cgr_object to save
- outfile – str file
Returns: None
-
cgr.
scale_cgr
(cgr_matrix: scipy.sparse.dok.dok_matrix) → scipy.sparse.dok.dok_matrix¶ returns scaled version of cgr_matrix in range 0..1
Param: cgr_matrix scipy.sparse.dok_matrix – matrix to scale Returns: scaled scipy.sparse.dok_matrix
-
cgr.
stack_cgrs
(cgr_matrices: Tuple) → numpy.ndarray¶ stacks cgrs of tuple of N numpy.ndarrays of shape (w,h) returns ndarray of ndarrays of shape (w,h,N)
Parameters: cgr_matrices – tuple of cgr_matrices Returns: numpy.ndarray
-
cgr.
write_out
(rgb: numpy.ndarray, out: str, resize: int) → None¶ writes RGB array as image
Parameters: - rgb – numpy.ndarray – RGB channel image
- out – str file to write to
- resize – bool or int. If False will not resize, if int will resize image up to that size
Returns: None