___           ___
     /  /\         /  /\
    /  /::\       /  /:/
   /  /:/\:\     /  /:/
  /  /::\ \:\   /  /::\____
 /__/:/\:\_\:| /__/:/\:::::\
 \  \:\ \:\/:/ \__\/~|:|~~~~              ___           ___           ___
  \  \:\_\::/     |  |:|    ___          /  /\         /  /\         /  /\
   \  \:\/:/      |  |:|   /__/\        /  /::\       /  /::\       /  /::\
    \__\::/       |__|:|   \  \:\      /  /:/\:\     /  /:/\:\     /  /:/\:\
        ~~         \__\|    \__\:\    /  /::\ \:\   /  /::\ \:\   /  /::\ \:\
                            /  /::\  /__/:/\:\_\:\ /__/:/\:\ \:\ /__/:/\:\ \:\
                           /  /:/\:\ \__\/~|::\/:/ \  \:\ \:\_\/ \  \:\ \:\_\/
                          /  /:/__\/    |  |:|::/   \  \:\ \:\    \  \:\ \:\
                         /__/:/         |  |:|\/     \  \:\_\/     \  \:\_\/
                         \__\/          |__|:|~       \  \:\        \  \:\
                                         \__\|         \__\/         \__\/


-- '( A B O U T ) --------------------------------------------------------------

This program implements a derivative of BK-Tree data structure described in
"Some Approaches to Best-Match File Searching" paper of W. A. Burkhard and
R. M. Keller. For more information about the paper, see

@article{362025,
 author = {W. A. Burkhard and R. M. Keller},
 title = {Some approaches to best-match file searching},
 journal = {Commun. ACM},
 volume = {16},
 number = {4},
 year = {1973},
 issn = {0001-0782},
 pages = {230--236},
 doi = {http://doi.acm.org/10.1145/362003.362025},
 publisher = {ACM},
 address = {New York, NY, USA},
}

In the implementation, I have used below structure to store values in the nodes:

struct node {
  distance: Metric distance between current node and its parent.
  value   : Value stored in current node.
  nodes   : Nodes collected under this node.
}

During every search phase, instead of walking through nodes via

  j = {j, j+1, j-1, j+2, j-2, ...}
    = {0, (-1)^i+1 * ceil(i/2)}, i = 1, 2, 3, ...

as described in the original paper, program sorts nodes according their relative
distance to value being searched:

  distance = d(searched-value, current-node-value)
  sort(nodes, lambda(node) { abs(distance - distance-of(node)) }

There is no restriction on the type of the value which will be stored in the
tree, as long as you supply appropriate metric function.


-- '( P E R F O R M A N C E ) --------------------------------------------------

Here is the results of a detailed test performed using BK-TREE package.

In every test, 100 random words are searched in the randomly created word
databases. Words stored in the database are varying from 5 characters upto 10
characters.

  DB Size | Threshold  | Scanned Node | Found Node
  (words) | (distance) |  Percentage  | Percentage
 ---------+------------+--------------+------------
   10,000 |          1 |        0.11  |      0.01
          |          2 |        0.11  |      0.01
          |          3 |        0.11  |      0.01
          |          4 |        0.16  |      0.03
          |          5 |        0.37  |      0.11
          |          6 |        7.6   |      6.58
          |          7 |       24.46  |     23.43
          |          8 |       51.36  |     49.09
 ---------+------------+--------------+------------
   50,000 |          1 |       0.0022 |      0.002
          |          2 |       0.0023 |      0.0021
          |          3 |       0.0251 |      0.003
          |          4 |       0.0408 |      0.0127
          |          5 |       0.3943 |      0.3196
          |          6 |       2.543  |      2.3869
          |          7 |       7.6876 |      7.3874
          |          8 |      23.6635 |     22.9339
 ---------+------------+--------------+-------------
  100,000 |          1 |       0.0012 |      0.0010
          |          2 |       0.0012 |      0.0011
          |          3 |       0.0013 |      0.0017
          |          4 |       0.0231 |      0.0085
          |          5 |       0.3383 |      0.2998
          |          6 |       1.7957 |      1.7079
          |          7 |       6.3571 |      6.1654
          |          8 |      18.5599 |     18.0996
 ---------+------------+--------------+-------------
  500,000 |          1 |       0.0027 |      0.0002
          |          2 |       0.0029 |      0.0002
          |          3 |       0.0039 |      0.0011
          |          4 |       0.0012 |      0.0081
          |          5 |       0.3444 |      0.3213
          |          6 |       0.4244 |      0.4201
          |          7 |      13.4834 |     13.3619
          |          8 |      30.3728 |     30.1665

How this table should be interpreted? The lower the difference between the third
and fourth columns, the less redundant node visit performed. And the stability
of this difference (which means no fluctuations in the difference) indicates the
stability of the convergence.


-- '( E X A M P L E ) ----------------------------------------------------------

Here is an example about how to used supplied interface.

(defpackage :bk-tree-test (:use :cl :bk-tree))

(in-package :bk-tree-test)

(defvar *words* nil)

(defvar *tree* (make-instance 'bk-tree))

;; Build *WORDS* list.
(with-open-file (in "/home/vy/lisp/english-words.txt")
  (loop for line = (read-line in nil nil)
        while line
        do (push
            (string-trim '(#\space #\tab #\cr #\lf) line)
            *words*)))

;; Check *WORDS*.
(if (endp *words*)
    (error "*WORDS* is empty!"))

;; Fill the *TREE*.
(mapc
 (lambda (word)
   (handler-case (insert-value word *tree*)
     (duplicate-value (ctx)
       (format t "Duplicated: ~a~%" (value-of ctx)))))
 *words*)

;; Let's see that green tree.
(print-tree *tree*)

;; Test BK-Tree.
(time
 (mapc
  (lambda (result)
    (format t "~a ~a~%" (distance-of result) (value-of result)))
  (search-value "kernel" *tree* :threshold 2)))

;; Test brute levenshtein.
(time
 (loop with target-word = "kernel"
       with results = (sort
                       (mapcar
                        (lambda (word)
                          (cons (levenshtein target-word word) word))
                        *words*)
                       #'<
                       :key #'car)
       repeat 50      
       for (distance . value) in results
       while (<= distance 2)
       do (format t "~a ~a~%" distance value)))


-- '( C A U T I O N S ) --------------------------------------------------------

For performance reasons, LEVENSHTEIN function coming with the package has some
limitations both on the input string and penalty costs.

  (deftype levenshtein-cost ()
    "Available penalty costs."
    '(integer 0 7))

  (deftype levenshtein-input-length ()
    "Maximum distance a comparison can result."
    `(integer 0 ,(- most-positive-fixnum 7)))

Just in case, configure these variables suitable to your needs.


-- '( L I C E N S E ) ----------------------------------------------------------

Copyright (c) 2007-2009, Volkan YAZICI <volkan.yazici@gmail.com>
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

- Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright notice, this
  list of conditions and the following disclaimer in the documentation and/or
  other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.