Welcome to cgr_view’s documentation!

Indices and tables

cgr

A module for creating, saving and drawing k-mer matrices and Chaos Game Representations (CGRs) of nucleotide sequences

Prerequisites

  • Jellyfish

An external program for counting k-mers. Must be accessible on the path. You can install from conda as follows:

conda install -c bioconda jellyfish

Quickstart

  • Input fasta file, get cgr

    • one cgr for each entry in the fasta file
    cgr.from_fasta("my_seqs.fa", outfile = "my_cgrs", k = 7)
    
    • just one cgr with all entries in the fasta file (eg for genomes and contigs)
    cgr.from_fasta("my_genome.fa", outfile = "genome_cgr", k = 7, as_single = True)
    

Workflow:

  1. make kmer count db in Jellyfish from fasta -> generate cgr from db.
  2. optionally merge cgrs into single cgr as separate channels
  3. stack all composed cgrs into an array of cgrs
  4. save as numpy binary (.npy) files

Usage:

  1. Import module

    import cgr
    
  2. Make kmer count db

    cgr.run_jellyfish("test_data/NC_012920.fasta", 11, "11mer.jf")
    cgr.run_jellyfish("test_data/NC_012920.fasta", 10, "10_mer.jf")
    
  1. Load CGRs from kmer count db

    cgr1 = cgr.cgr_matrix("/Users/macleand/Desktop/athal-5-mers.jf")
    cgr2 = cgr.cgr_matrix("test_data/five_mer.jf")
    
  2. Draw a cgr and save to file

    • just one cgr, can choose colour (value of ‘h’) and which channel to put cgr in
    cgr.draw_cgr(cgr1, h = 0.64, v = 1.0, out = "my_cgr.png", resize = 1000, main = "s" )
    
    • two cgrs, first in tuple goes in ‘h’, second goes in ‘s’. Can set ‘v’
    cgr.draw_cgr( (cgr1, cgr1), v = 1.0, out = "two_cgrs.png")
    
    • three cgrs ‘h’,’s‘ and ‘v’ are assigned as order in tuple
    cgr.draw_cgr( (cgr1, cgr1, cgr1) )
    
  3. Save a single cgr into a text file

    cgr.save_as_csv(cgr1, file = "out.csv")
    
  4. Join n cgrs into one, extending the number of channels …

    merged_cgr = cgr.join_cgr( (cgr1, cgr2, ... ) )
    
  5. Write to file (numpy binary)

    cgr.save_cgr("my_cgr, merged_cgr )
    
  6. Input fasta file, get cgr
    • one cgr for each entry in the fasta file
    cgr.from_fasta("my_seqs.fa", outfile = "my_cgrs", k = 7)
    
    • just one cgr with all entries in the fasta file (eg for genomes and contigs)
    cgr.from_fasta("my_genome.fa", outfile = "genome_cgr", k = 7, as_single = True)
    
cgr.blocky_scale(im: numpy.ndarray, nR: int, nC: int) → numpy.ndarray

Upscales an array in preparation for drawing. By default the array is a square with sqrt(k ** 4) wide and high. For many values of k this will be too small to view well on a monitor. This function does a scale operartion that increases the size of the image by simply increasing the pixels in each square.

Param:im numpy.ndarray – the image to be scaled
Param:nR int – the number of height pixels to be in the final image
Param:nC int – the number of width pixels to be in the final image
Returns:numpy.ndarray – upscaled image
cgr.cgr_matrix(jellyfish: str) → scipy.sparse.dok.dok_matrix

Main function, creates the cgr matrix, a sparse matrix of type scipy.sparse.dok_matrix

Runs the cgr process on a jellyfish file and returns a scipy.sparse.dok_matrix object of the CGR with dtype int32 Only observed kmers are represented, absent coordinates mean 0 counts for the kmer at that coordinate.

Param:jellyfish str – jellyfish DB file
Returns:scipy.sparse.dok_matrix – sparse matrix of kmer counts
cgr.draw(rgb: numpy.ndarray) → None

renders RGB array on the screen.

Param:rgb numpy.ndarray – RGB channel image
cgr.draw_cgr(cgr_matrices: scipy.sparse.dok.dok_matrix, h: float = 0.8, s: float = 0.5, v: float = 1.0, main: str = 's', show: bool = True, write: bool = True, out: str = 'cgr.png', resize: bool = False) → None

Draws cgrs to a file. Allows user to set which of up to 3 provided cgr matrices goes in at which of the H, S or V image channels. Typically for one channel, select h to specify the image colour and set cgr as s to change that colour according to counts in cgr. Set v to 1.0 for maximum brightness.

Param:cgr_matrices scipy.sparse.dok_matrix or tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn. Tuple provides order for HSV channels of image.
Param:h float – (0..1) value for h channel if not used for cgr data
Param:s float – (0..1) value for s channel if not used for cgr data
Param:v float – (0..1) value for v channel if not used for cgr data
Param:main str – the channel to place the cgr matrix in if a single cgr matrix is passed
Param:show bool – render CGR picture to screen
Param:write – write CGR picture to file
Param:out str – filename to write to
Param:resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height
Returns:None
cgr.draw_single_cgr(cgr_matrix, h=0.8, s=0.5, v=1.0, main='s', show=True, write=True, out='cgr.png', resize=False)

draws a single cgr image, selecting channels and resizing as appropriate

Param:cgr_matrix scipy.sparse.dok_matrix to be drawn.
Param:h float – (0..1) value for h channel if not used for cgr data
Param:s float – (0..1) value for s channel if not used for cgr data
Param:v float – (0..1) value for v channel if not used for cgr data
Param:main str – the channel to place the cgr matrix in
Param:show bool – render CGR picture to screen
Param:write – write CGR picture to file
Param:out str – filename to write to
Param:resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height
Returns:None
cgr.draw_three_cgrs(cgr_matrices, show=True, write=True, out='cgr.png', resize=False)

Draws a tuple of 3 cgr matrices as an image

Param:cgr_matrices tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn. Tuple provides order for HSV channels of image
Param:show bool – render CGR picture to screen
Param:write – write CGR picture to file
Param:out str – filename to write to
Param:resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height
Returns:None
cgr.draw_two_cgrs(cgr_matrices, v=1.0, show=True, write=True, out='cgr.png', resize=False)

draws two cgr matrices into a single image. first matrix of tuple becomes h channel, second of tuple becomes v channel

Param:cgr_matrices tuple of scipy.sparse.dok_matrix elements, cgrs to be drawn.
Param:v float – (0..1) value for v channel
Param:show bool – render CGR picture to screen
Param:write – write CGR picture to file
Param:out str – filename to write to
Param:resize bool or int – if False no image resizing is done, if an int image is rescaled to resize pixels width and height
Returns:None
cgr.estimate_genome_size(fasta: str) → int

Guesses genome size from fasta file size, assumes 1 byte ~= 1 base

Param:fasta str – a fasta file
Returns:int – approximate genome size in nucleotides
cgr.from_fasta(fasta_file: str, outfile: str = 'my_cgrs', as_single: bool = False, k: int = 7) → None

Factory function to load in a FASTA file and generate a binary .npy of CGRs

Parameters:
  • fasta_file – str FASTA file to load
  • outfile – str outfile to save
  • as_single – bool If True treats all entries as single sequence and return one CGR. If False, treats all entries individually and returns many CGR
  • k – int length of kmer to use
Returns:

None

cgr.get_coord(kmer: str) → List[int]

given a kmer gets the coordinates of the box position in the cgr grid, returns as list [x,y] of coordinates

Param:kmer str – a string of nucleotides
Returns:coords [x,y] – the x,y positions of the nucleotides in the cgr
cgr.get_grid_size(k: int) → int

returns the grid size (total number of elements for a cgr of k length kmers

Param:k int – the value of k to be used
Returns:int – the total number of elements in the grid
cgr.get_k(jellyfish: str) → int

asks the jellyfish file what value was used for k

Param:jellyfish str – jellyfish DB file
Returns:int – length of k used
cgr.get_kmer_list(jellyfish: str) → Generator[List[T], str, None]

runs jellyfish dump on a Jellyfish DB. Captures output as a generator stream. Each item returned is a list [kmer: str, count: str]

Param:jellyfish str – a Jellyfish DB file
Returns:Generator – a list of [kmer string, times_kmer_seen]
cgr.get_max_count(jellyfish) → int

estimates the count of the most represented kmer in the jellyfish file by using the last bucket of the :param jellyfish: :return: int estimated count of the most represented kmer

cgr.is_cgr_matrix(obj) → bool

returns true if obj is a scipy.sparse.dok.dok_matrix object

cgr.join_cgr(cgrs: tuple) → numpy.ndarray

Takes tuple of cgrs of shape (n,n) and returns one stacked array of size (n,n, len(cgrs) )

Param:cgrs tuple – tuple of cgrs to be joined
Returns:numpy.ndarray
cgr.load_npy(file: str) → numpy.ndarray

loads numpy .npy file as ndarray. Useful for restoring collections of cgrs but resulting array is not compatible directly with drawing methods here.

:param file str – numpy .npy file to load :return: numpy.ndarray

cgr.make_blanks_like(a: scipy.sparse.dok.dok_matrix, h: float = 1.0, s: float = 1.0, v: float = 1.0) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

returns tuple of numpy.ndarrays with default values of h,s and v of shape of a

Param:a scipy.sparse.dok_matrix – a cgr matrix to make blanks like
Param:h float – the values with which to fill the first numpy.ndarray
Param:s float – the values with which to fill the second numpy.ndarray
Param:v float – the values with which to fill the third numpy.ndarray
Returns:Tuple of numpy.ndarray
cgr.many_seq_record_to_many_cgr(seq_record: <module 'Bio.SeqIO.FastaIO' from '/home/docs/checkouts/readthedocs.org/user_builds/cgr-view/envs/latest/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py'>, k: int) → scipy.sparse.dok.dok_matrix
Parameters:
  • seq_record – Bio.SeqIO FASTA record
  • k – int size of k to use
Returns:

scipy.sparse.dok_matrix

cgr.many_seq_record_to_one_cgr(fa_file: str, k: int) → scipy.sparse.dok.dok_matrix

Reads many sequence records in a FASTA file into a single CGR matrix, treating all sequence records as if they are one sequence, EG as if for a genome sequence in Chromosomes. :param fa_file: str FASTA FILE name :param k: int length of k to use :return: scipy.sparse.dok_matrix

cgr.resize_rgb_out(rgb: numpy.ndarray, resize: int) → numpy.ndarray

given an rgb image in one pixel per kmer size, increases size so that the resulting image is resize * resize pixels

Param:rgb numpy.ndarray – an RGB image array
Param:resize – pixel width (and therefore height) of resulting image
Returns:numpy.ndarray – resized image with shape (resize, resize)
cgr.run_jellyfish(fasta: str, k: int, out: str) → int

runs Jellyfish on fasta file using k kmer size, produces Jellyfish db file as side effect.

Param:fasta str – a fasta file
Param:k int – size of kmers to use
Param:out str – file in which to save kmer db
Returns:int – return code of Jellyfish subprocess
cgr.save_as_csv(cgr_matrix: scipy.sparse.dok.dok_matrix, file: str = 'cgr_matrix.csv', delimiter: str = ', ', fmt: str = '%d')

Writes simple 1 channel cgr matrix to CSV file.

See also numpy.savetxt

Param:cgr_matrix scipy.sparse.dok_matrix – cgr_matrix to save
Param:file str – filename to write to
Param:delimiter str – column separator character
Param:fmt str – text format string
Returns:None
cgr.save_cgr(cgr_obj: numpy.ndarray, outfile: str = 'cgr') → None

Saves cgr_obj as numpy .npy file. cgr_obj one or more dimensional numpy.ndarray. saves as ndarray not dokmatrix, so can be loaded in regular numpy as collections of cgrs

Parameters:
  • cgr_obj – numpy.ndarray constructed cgr_object to save
  • outfile – str file
Returns:

None

cgr.scale_cgr(cgr_matrix: scipy.sparse.dok.dok_matrix) → scipy.sparse.dok.dok_matrix

returns scaled version of cgr_matrix in range 0..1

Param:cgr_matrix scipy.sparse.dok_matrix – matrix to scale
Returns:scaled scipy.sparse.dok_matrix
cgr.stack_cgrs(cgr_matrices: Tuple) → numpy.ndarray

stacks cgrs of tuple of N numpy.ndarrays of shape (w,h) returns ndarray of ndarrays of shape (w,h,N)

Parameters:cgr_matrices – tuple of cgr_matrices
Returns:numpy.ndarray
cgr.write_out(rgb: numpy.ndarray, out: str, resize: int) → None

writes RGB array as image

Parameters:
  • rgb – numpy.ndarray – RGB channel image
  • out – str file to write to
  • resize – bool or int. If False will not resize, if int will resize image up to that size
Returns:

None