CCGG

What is the Graphical Genome

The Graphical Genome is an implementation of a pangenome that uses a directed graph to represent sequence data from disparate animals in one representation.

Graphical Genome basic usage

The Graphical Genome, when initialized, returns an object with four instance data dictionaries, nodes, edges, incoming, outgoing.

GraphicalGenome.nodes : Dictionary of nodes. Referenced with string node names

Each Node currently contains the following keys:
- A-H = (str) : Each key A-H contains the relative position the original founder sequence that the node 45-mer can be found
- repeatclass = list(str) : Contains the list of all repeat classes for the given 45-mer
- seq = (str) : Sequence GraphicalGenome.edges : Dictionary of edges. Referenced with string edge names
Each Edge in the dictionary currently contains the following keys:
- src = (str) : Source node for the edge
- dst = (str) : Destination node for the edge
- strains = list(str) : List of strains that lie on the edge
- seq = (str) : Sequence GraphicalGenome.outgoing : Dictionary of incoming edges. Referenced with string node names GraphicalGenome.incoming : Dictionary of outgoing edges. Referenced with string edge names

Source and Sink Edges

The 8 founder strain sequences start and end with a "Source" and "Sink" edge.

The following is an example of a source and sink edge on chromosome 1 for strain A. While normal edges begin with "E", source nodes begin with "S" and sink nodes begin with "K". The remainder of their name includes the relevant chromosome information and the strain for which they are a source of sink.

GraphicalGenome.edges["S01.00000001"] = {"src" : "SOURCE", "seq" : <sequence>, "dst" : <node>, "strain" : ["A"]}

GraphicalGenome.edges["K01.00000001"] = {"src" : <node>, "seq" : <sequence>, "dst" : "SINK", "strain" : ["A"]}

API

Initialization

Initalize a new graphical genome instance

Parameters:
- (str) nodefile - Absolute path to the nodefile
- (str) edgefile - Absolute path to the edgefile
Optional Parameters
- (bool) verbose=False - Prints verbose output as Graphical genome is being generated.
- (int) enamelen=12 - Defines the edge key name length.
- (int) nnamelen=12 - Defines the node key name length.
Output: new instance of Graphical Genome

import CCGG
genome = GraphicalGenome(nodefile, edgefile) # Initialize Graphical Genome instance with the specified node and edge file
genome.edges[<edgename>]["strains"] # Example of looking at the list of strains given an edgename
genome.nodes[<nodename>]["repeatclass"] # Example of returning the list of repeat classes of a given node

`writeFasta`

Write updated node and edge dictionaries in a FASTA format.

Parameters:
- (str) filename - Absolute path for the file that you would like to write. Files that do not exist will be created in the specified location with the specified name.
- (dict) file_dict - Node or Edge dictionary that you would like to write.
- (list) keylist - List of strings of dictionary keys tha tyou want to ignore during the file write. This will require you to handle these keys on your own.
Output: FASTA file created in the specified location

import CCGG
genome = GraphicalGenome(nodefile, edgefile) # Initialize Graphical Genome instance with the specified node and edge file
genome.nodes[<nodename>]["new_annotation"] = <new_data> # Add a new annotation to the specified node
genome.writeFasta("FASTAFolder/newfile.fa", newinstance.nodes) # Writes updated node dictionary to the specified FASTA file

`returnNodeEdgePath`

Returns a list of [node,edge,node,edge...] that traces out a path for a strain on a given chromosome.

Parameters:
- (str) strain - Founder or CC strain you wish to trace.
- (str) chromo - Chromosome you are iterating through.

import CCGG
genome = GraphicalGenome(nodefile, edgefile) # Initialize Graphical Genome instance with the specified node and edge file
genome.returnNodeEdgePath(<strain>, <chromo>) # Get the list of nodes and edges for a strain

`findMissingPaths`

Given a Collaborative Cross path, look for all node pairs that do not contain an edge between them that has the specified Collaborative Cross strain on them.

Parameters:
- (str) strain - Founder or CC strain you wish to trace.
- (str) chromo - Chromosome you are iterating through.
Deprecation warning:
- THIS CURRENTLY WORKS FOR EDGES WITHOUT FLOATING NODES, WHEN FLOATING NODES ARE ADDED THIS NEEDS TO BE UPDATED
Output:
- (list(tuples(node, node))) - Retuns a list of node tuples that represent the missing paths

import CCGG
genome = GraphicalGenome(nodefile, edgefile) # Initialize Graphical Genome instance with the specified node and edge file
paths = genome.findMissingPaths(<strain>, <chromo>) # Get the missing paths as a list

`reconstructSequence`

Reconstruct the sequence given a strain (ABCDEFGH or CC) and can return more than one sequence if heterozygosity is known.

Parameters:
- (str) strain - The strain that you want to trace
Optional parameters:
- (int) path - 0 for base path, 1 for first heterozygous path identified, 2 for second heterozygous path identified.
Output: The entire genomic sequence, from start to finish, for the given strain

import CCGG
genome = GraphicalGenome(nodefile, edgefile) # Initialize Graphical Genome instance with the specified node and edge file
sequence = genome.reconstructSequence(<strain>)

`nextAnchor`

Return the next monotonically incrementing anchor given a nodename in the graph

Parameters:
- (str) anchor - The current anchor node
Output:
- (str) - The next monotonically increasing anchor node

import CCGG
# Basic usage
genome = CCGG.GraphicalGenome(nodefile, edgefile)
nextAnchor = genome.nextAnchor(<anchor>)

# Iterating through the genome anchor to anchor
currentAnchor = <anchor>
while currentAnchor != "SINK":
    # Add any analysis here #
    currentAnchor = genome.nextAnchor(currentAnchor)

`prevAnchor`

Return the previously monotonically decreasing anchor given a nodename in the graph

Parameters:
- (str) anchor - The current anchor node
Output:
- (str) - The next monotonically decreasing anchor node

import CCGG
# Basic usage
genome = CCGG.GraphicalGenome(nodefile, edgefile)
nextAnchor = genome.prevAnchor(<anchor>)

# Iterating through the genome backwards anchor to anchor
currentAnchor = <anchor>
while currentAnchor != "SOURCE":
    # Add any analysis here #
    currentAnchor = genome.prevAnchor(currentAnchor)

`tracePath`

Return a [node, edge, node...] path tracing a strain between two anchors

Parameters
- (str) strain - the strain that you want to trace through the genome
- (str) start - the anchor where you would like to start the trace
- (str) end - the anchor where you would like to end the trace
Output
- (list(node,edge,node,...)) - Output of alternating node, edge pairs that define a path through the graph from the start node to the end node.

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile) # Instantiate the genome object
path = genome.tracePath(<strain>, <start>, <end>)

`entityPath`

Return the path through the entire genome for a given strain as a node,edge,node... list.

Optional Parameters:
- (str) strain - The strain you would like to trace through the genome. (ABCDEFGH or CC)
  - Defaults to "B"
Output:
- The [node, edge, node...] path from the beginning to the end of the genome that traces the specified strain.

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile) # Instantiate the path
path = genome.entityPath("A") # Returns the node, edge, node... list that defines the path through the entire genome for the "A" strain

`findEdge`

Given a node and a strain, return the incoming or outgoing edge that contains the strain. In the case of CC strains this will also look for cases of heterozygosity and potentially return more than one edge that contains each heterozygous strain.

Parameters:
- (str) node - The node who's edge you want to examine
- (str) strain - the specified strain you wish to find on the edge
Optional Parameters:
- (str) origin - If origin == "src" then return outgoing edges with the specified strain on them. If origin == "dst" then return incoming edges with the specified strain on them.
  - Defaults to "src".
  - Must be an option either in "src" or "dst".
Output:
- Return an edge or multiple edges that contains the specified strain.

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile)
edges = genome.findEdge(<node>, <strain>, src="dst") # return the incoming edge/edges that contain the desired strain

`anchorPosList`

populates global field: self.sortednodes. It will return a dictionary of anchors as keys with 8 [founder] values that show their.

`boundingAnchor`

Returns the bouding anchors of the given strain and coordinate positions.

Parameters:
- (str) strain - Founder or CC path that you want to follow
- (int) start - Linear coordinate representing desired left bound
- (int) end - Linear coordinate representing desired right bound
Output:
- Returns the two bounding anchors that most closely contain the start and end linear coordinates provided.

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile)
leftBoundingAnchor, rightBoundingAnchor = genome.boudingAnchor("B", <start>, <end>)

`findNumOfFloatingNodes`

Returns the number of FloatingNodes between anchor pairs

Parameters:
- (str) start - The starting anchor
- (str) end - The ending anchor
Output Returns the number of floating nodes between two anchors

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile)
numOfFloatingNodes = genome.findNumOfFloatingNodes(<startAnchor>, <endAnchor>)

`findsequence`

Given a [node, edge, node, edge ....] list defining a path we construct the sequence on the provided path

Parameters:
- (list[str]) pathlist - The list of [node, edge, node...] names that define a path through the genome
Optional Parameters:
- (bool) countinganchor - Flag that states whether to count the first and last element in the pathlist
Output:
- Returns the sequence for the input pathlist

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile)
pathlist = [<node1>, <edge1>, <node2>, <edge2>, <node3>, ...]
sequence = genome.findsequence(pathlist)

returnNodeEdgePath

Given a strain, return an alternating list of edges and nodes that traces the strain through the graphical genome.

Parameters:
- (str) strain - Strain you wish to trace
- (str) chromo - Chromosome you are tracing the strain through
Output:
- list(str) - List of edges and nodes starting with a "sink" edge and ending with a "sink" edge

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile)
tracepath = genome.returnNodeEdgePath(<strain>, <chromo>)

DemoteAnchor

Given an anchor node, convert it into a sequence of floating node -> edge -> floating node

Parameters:
- (str) anchor - The anchor being demoted
Output:
- Modifies internal dictionary. Does not return a value

import CCGG
genome = CCGG.GraphicalGenome(nodefile, edgefile)
genome.DemoteAnchor(<anchor>)

If you have any question or comment, please contact at ziwei75@live.unc.edu or hangsu@email.unc.edu.

CCGG

Collaborative Cross Graphical Genome

What is the Graphical Genome

Graphical Genome basic usage

Source and Sink Edges

API

Initialization

writeFasta

returnNodeEdgePath

findMissingPaths

reconstructSequence

nextAnchor

prevAnchor

tracePath

entityPath

findEdge

anchorPosList

boundingAnchor

findNumOfFloatingNodes

findsequence

returnNodeEdgePath

DemoteAnchor

`writeFasta`

`returnNodeEdgePath`

`findMissingPaths`

`reconstructSequence`

`nextAnchor`

`prevAnchor`

`tracePath`

`entityPath`

`findEdge`

`anchorPosList`

`boundingAnchor`

`findNumOfFloatingNodes`

`findsequence`