Table types
Type names consist of two parts. The first part indicates how far the type
can count and the second part whether it is a table or a graph.
Possible choices for the first part:
* Node, uses 1bit counter
* SmallCount, uses a 4bit counter
* Count, uses a 8bit counter
Possible choices for the second part:
* Table, keep track of kmers
* Graph, navigate, tag, etc the de Bruijn graph formed by the k-mers
C++ class name:
Python methods:
- k = ksize() - return the k-mer size (a positive integer).
- hashval = hash(dna_kmer) - return the result of hashing
dna_kmer
, which will be a non-negative integer. len(dna_kmer)
must be exactly the k-mer size. Which hash function is used is dependent on the table type (@document).
- dna_kmer = reverse_hash(hashval) - return a DNA string that will hash to
hashval
. If there are multiple such strings, return only one. May be unimplemented for particular table types in which case a ValueError will be returned.
- sizelist = hashsizes() - return the list of table sizes used in construction.
- n = n_unique_kmers() - retrieve an estimate for the number of unique k-mers inserted into the table. Note, this may be order dependent.
- n = n_occupied() - retrieve the fraction of bins occupied in the table.
- add(dna_kmer_or_hashval) - increment the count associated with either a DNA k-mer or a hashval. Depending on max count for the tabletype and bigcount settings, the count may top out at 1, 15, 255, or 65535. (@CTB add method for retrieving max_count)
- count (synonym for add)
- get(dna_kmer or hashval) - retrieve the count associated with a DNA k-mer or a hashval.
- list_of_strings = get_kmers(seq) - return the list of k-mer strings in the given sequence.
- list_of_hashes = get_kmer_hashes(seq) - return the list of the hashed k-mers in the given sequence.
- hashset = get_kmer_hashes_as_hashset(seq) - return the hashset of hashed k-mers in the given sequence.
- list_of_counts = get_kmer_counts(seq) - return the list of the counts of the k-mers in the given sequence.
- save(filename) - save the data to a file on disk.
- load(filename) - load the data from a file on disk.
- min_count = get_min_count(seq) - return the minimum count for k-mers in seq
- med_count, avg_count, stddev_count = get_median_count(seq) - return the median, average, and stddev of the counts for k-mers in the sequence.
- max_count = get_max_count(seq) - return the maximum count for k-mers in the sequence.
- num_kmers = consume(seq) - count all the k-mers in the given DNA string. @CTB should be consume string
- consume_fasta(filename) - count all the k-mers in a (DNA) FASTA/FASTQ file.
- consume_fasta_with_reads_parser(khmer.ReadParser object) - count all the k-mers in (DNA) sequences returned from a ReadParser object. This can be used to ensure various forms of pairing are present, etc.
- (trim_seq, trim_pos) = trim_on_abundance(seq, abund) - trim the sequence at the first k-mer with an abundance strictly below (<) the provided abundance.
- (trim_seq, trim_pos) = trim_below_abundance(seq, abund) - trim the sequence at the first k-mer with an abundance strictly above (>) the provided abundance.
- list_of_posns = find_spectral_error_positions(seq, abund) - find potential locations of errors in input sequence, where k-mers are below given abundance.
- set_use_bigcount(bool) - some table types (Counttable and Countgraph) support counting past their max value, using a (memory intensive) C++ stl::map. That turns on this "bigcount" behavior. This will raise a ValueError when called on a table type that does not support it.
- get_use_bigcount - return the bigcount value (False by default).
- dist = abundance_distribution(filename, tracking_obj) - generate an abundance distribution for the k-mers in the given file, using the tracking_obj to avoid double-counting identical k-mers.
- dist = abundance_distribution_with_reads_parser(readparser_obj, tracking_obj) - generate an abundance distribution for the k-mers loaded from the given readparser, using the tracking_obj to avoid double-counting identical k-mers.