The p-spectrum string kernel.
PSpectrumStringKernel (const std::vector< std::vector< std::string > > &datasets, const size_t p)
Initialize the PSpectrumStringKernel with the given string datasets. const std::vector< std::vector
< std::map< std::string, int > > > & Counts () const "
Access the lists of substrings. std::vector< std::vector
< std::map< std::string, int > > > & Counts ()"
Modify the lists of substrings. template<typename VecType > double Evaluate (const VecType &a, const VecType &b) const
Evaluate the kernel for the string indices given. size_t P () const
Access the value of p. size_t & P ()
Modify the value of p. std::string ToString () const
std::vector< std::vector
< std::map< std::string, int > > > counts"
Mappings of the datasets to counts of substrings. const std::vector< std::vector
< std::string > > & datasets"
The datasets. size_t p
The value of p to use in calculation.
The p-spectrum string kernel.
Given a length p, the p-spectrum kernel finds the contiguous subsequence match count between two strings. The kernel will take every possible substring of length p of one string and count how many times it appears in the other string.
The string kernel, when created, must be passed a reference to a series of string datasets (std::vector<std::vector<std::string> >&). This is because MLPACK only supports datasets which are Armadillo matrices -- and a dataset of variable-length strings cannot be easily cast into an Armadillo matrix.
Therefore, once the PSpectrumStringKernel is created with a reference to the string datasets, a 'fake' Armadillo data matrix must be created, which simply holds indices to the strings they represent. This 'fake' matrix has two rows and n columns (where n is the number of strings in the dataset). The first row holds the index of the dataset (remember, the kernel can have multiple datasets), and the second row holds the index of the string. A fake matrix containing only strings from dataset 0 might look like this:
[[0 0 0 0 0 0 0 0 0] [0 1 2 3 4 5 6 7 8]]
This fake matrix is then given to the machine learning method, which will eventually call PSpectrumStringKernel::Evaluate(a, b), where a and b are two columns of the fake matrix. The string kernel will then map these fake columns back to the strings they represent, and then correctly evaluate the kernel.
Unfortunately, not every machine learning method will work with this kernel. Only machine learning methods which do not ever operate on the explicit representation of points can use this kernel. So, for instance, one cannot build a kd-tree on strings, because the BinarySpaceTree<> class will split the data according to the fake data matrix -- resulting in a meaningless tree. This kernel was originally written for the FastMKS method; so, at the very least, it will work with that.
Definition at line 74 of file pspectrum_string_kernel.hpp.
Initialize the PSpectrumStringKernel with the given string datasets. For more information on this, see the general class documentation.
Parameters:
datasets Sets of string data.
p The length of substrings to search.
Access the lists of substrings.
Definition at line 102 of file pspectrum_string_kernel.hpp.
References counts.
Modify the lists of substrings.
Definition at line 105 of file pspectrum_string_kernel.hpp.
References counts.
Evaluate the kernel for the string indices given. As mentioned in the class documentation, a and b should be 2-element vectors, where the first element contains the index of the dataset and the second element contains the index of the string. Therefore, if [2 3] is passed for a, the string used will be datasets[2][3] (datasets is of type std::vector<std::vector<std::string> >&).
Parameters:
a Index of string and dataset for first string.
b Index of string and dataset for second string.
Access the value of p.
Definition at line 109 of file pspectrum_string_kernel.hpp.
References p.
Modify the value of p.
Definition at line 111 of file pspectrum_string_kernel.hpp.
References p.
Definition at line 116 of file pspectrum_string_kernel.hpp.
References datasets, and mlpack::util::Indent().
Mappings of the datasets to counts of substrings. Such a huge structure is not wonderful...
Definition at line 133 of file pspectrum_string_kernel.hpp.
Referenced by Counts().
The datasets.
Definition at line 129 of file pspectrum_string_kernel.hpp.
Referenced by ToString().
The value of p to use in calculation.
Definition at line 136 of file pspectrum_string_kernel.hpp.
Referenced by P().
Generated automatically by Doxygen for MLPACK from the source code.