Abstract
In this paper we define a new similarity measure: LCSk, aiming at finding the maximal number of k length substrings matching in both input strings while preserving their order of appearance, for which the traditional LCS is a special case, where k=1. We examine this generalization in both theory and practice. We first describe its basic solution and give an experimental evidence in real data for its ability to differentiate between sequences that are considered similar according to the LCS measure. We then examine extensions of the LCSk definition to LCS in at least k-length substrings (LCS≥k) and 2-dimensional LCSk and also define complementary EDk and ED≥k distances.
Original language | English |
---|---|
Pages (from-to) | 11-26 |
Number of pages | 16 |
Journal | Theoretical Computer Science |
Volume | 638 |
DOIs | |
State | Published - 25 Jul 2016 |
Externally published | Yes |
Keywords
- Dynamic programming
- Edit distance
- Longest common subsequence
- Similarity of strings