Name DSLIP Description Info
------------------ ----- ------------------------------------------- --------
String
::KeyboardDistance Rdpfp Approximate String Comparison KRBURTON
Approximate string comparison.
NAME
String::KeyboardDistance - String Comparison Algorithm
SYNOPSIS
use String::KeyboardDistance qw(:all);
my $s1 = 'Apple';
my $s2 = 'Wople';
# compute a match probability
my $pr = qwerty_keyboard_distance_match('Apple','Wople');
# find the keyboard distance between two strings
my $dst = qwerty_keyboard_distance('IBM','HAL');
# find the keyboard distance between two characters
$dst = qwerty_char_distance('a','v');
print "maximum distance: $qwerty_max_distance\n";
DESCRIPTION
This module implmements a version of keyboard distance for fuzzy string
matching. Keyboard distance is a measure of the physical distance
between two keys on a keyboard. For example, 'g' has a distance of 1
from the keys 'r', 't', 'y', 'f', 'h', 'v', 'b', and 'n'. Immediate
diagonals (like ''r, 'y', 'v', and 'n') are considered to have a
distance of 1 instead of 1.414 to help to prevent horizontal/vertical
bias.
A match probability between two strings is computed from the total
distances between corresponding characters divided by the length of the
longer string multiplied by the maximum distance between the two
furthest keys on the keyboard.
The functions in this module use a simple grid of keys. For the qwerty
mapping, the grid is similar to the following:
| 0 1 2 3 4 5 6 7 8 9 10 11 12 13
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
0 | ` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | - | = | |
1 | | q | w | e | r | t | y | u | i | o | p | [ | ] | \ |
2 | | a | s | d | f | g | h | j | k | l | ; | ' | | |
3 | | z | x | c | v | b | n | m | , | . | / | | | |
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
The grids for both qwerty and dvorak are based on PC style keyboards.
Shifted characters have the same coordinates (a and A, 6 and ^).
EXPORT
This module exports no symbols by default. The following functions are
available for export through EXPORT_OK:
build_qwerty_map
max_qwerty_distance
qwerty_char_distance
qwerty_keyboard_distance
qwerty_keyboard_distance_match
build_dvorak_map
max_dvorak_distance
dvorak_char_distance
dvorak_keyboard_distance
dvorak_keyboard_distance_match
grid_distance
The following varialbes are availalbe for export through EXPORT_OK:
@qwerty_grid
$qwerty_map
$qwerty_max_distance
@dvorak_grid
$dvorak_map
$dvorak_max_distance
Additionaly, this module supports the following export tags:
:Functions - import all functions
:Variables - import all variables
:all - import both functions and variables
API REFERENCE
build_qwerty_map
param: array reference to receive the map [optional]
return: the map (array reference)
This function builds a map of each character to its corresponding
location on the keyboard. The location is derived by looking at the
keyboard as a simple grid in which the location of keys on the keyboard
are considereed to be the same as their shifted values. The following
keys are considered to have the same location:
1 and !
r and r
/ and ?
The map is an array, where the index is the value returned by chr() for
the character at that location. The value is an array ref describing the
location of the character on the keyboard. The first element is the y
position, the second is the x, and the third represents the shift value
(0 for non-shifted, 1 for shifted). All non-keyable characters,
including tabs and spaces, will have undef values in the map, meaning
they have no point on the grid. These non-key characterss should be
considered to be the maximum distance away from any other character
except themselves.
The map is constructed at run-time from the @qwerty_grid package global,
and cached in the $qwerty_map package global.
build_dvorak_map
param: array reference to receive the map [optional]
return: the map (array reference)
This function is identical to build_qwerty_map, with the exception that
it builds a map for dvorak keyboards, and the map is constructed from
the @dvorak_grid package global, and cached in the $dvorak_map package
global.
max_qwerty_distance
return: maximum distance
This function dynamicly computes the maximum distance between keys in
the qwerty map. The maximum key distance is stored in the
$qwerty_max_distance package global.
max_dvorak_distance
return: maximum distance
This function dynamicly computes the maximum distance between keys in
the dvorak map. The maximum key distance is stored in the
$dvorak_max_distance package global.
qwerty_char_distance
param: char 1
param: char 2
return: distance
This function computes the distance between the two characters passed on
a qwerty keyboard.
dvorak_char_distance
param: char 1
param: char 2
return: distance
This function computes the distance between the two characters passed on
a dvorak keyboard.
grid_distance
param: x1 - point 1's x coordinate
param: y1 - point 1's y coordinate
param: x2 - point 2's x coordinate
param: y2 - point 2's y coordinate
return: distance between points
This function returns the distance between two points. If the two points
have an x distance of 1, and a y distance of 1, then they are considered
to be a distance of 1 apart. This is meant to help prevent
horizontal/vertical bias in the distancing function. Otherwise we use
the following formula:
sqrt( (x1 - x2)**2 + (y1 - y2)**2 );
qwerty_keyboard_distance
param: string1
param: string2
return: distance
Returns the sum of the distances between corresponding characters in the
two strings. If one string is longer than the other the remaining
characters are counted as having the same value as the maximum distance.
qwerty_keyboard_distance_match
param: string1
param: string2
return: probability of match
The probability of a match is:
Pr = 1 - ( D / (L * M) )
Where D is the distance between the two strings, L is the length of the
longer string, and M is the maximum character distance.
dvorak_keyboard_distance
param: string1
param: string2
return: distance
Returns the sum of the distances between corresponding characters in the
two strings. If one string is longer than the other the remaining
characters are counted as having the same value as the maximum distance.
dvorak_keyboard_distance_match
param: string1
param: string2
return: probability of match
The probability of a match is computed in the same way as for
qwerty_keyboard_distance_match().
BUGS
The grids are specific to US style PC keyboards. It is possible to
substitue and use your own keyboard maps, but it's not well documented.
Also, even US keyboards differ, for instance, in where they have the
backslash and pipe characters. The grids in this module assume that it's
in the last position of the second row.
The space and tab keys are ignored entierly. Whitespace is irrelevant
for the most common use of this module (for the author at least), which
is matching words or names.
Numbers are distanced from their locations at the top of the keyboard.
It should be an option to perform a keyboard distance using only the
numeric keypad.
If one string is a substring of the other (abstained vs stained), the
matching could be improved by either starting at the substring offset,
or by prepending spaces onto the front of the shorter string so it
matched the substring offset.
Currently the maps, and the max lengths are computed dynamicly. The
module would load faster if these were hard coded data.
It should be possible to speed up the qwerty_char_distance and
dvorak_char_distance functions by caching (for example, by using the
Memoize module).
AUTHOR
Kyle R. Burton
HMS 625 Ridge Pike Building E Suite 400 Conshohocken, PA 19428
SEE ALSO
perl(1). String::Approx. Text::DoubleMetaphone. String::Similarity.