One of the most popular functions in text search systems is the function of comparing two lines [1]. Judging by the posts and publications, many programmers prefer the Levenshtein algorithm, which ensures the implementation of fuzzy search, i.e. search when not strictly specifying the search phrase [2,3,7]. The presence of the corresponding

The author of this article used the named subroutines (and a number of others) in the formation of the semantic kernel of multi-page sites. The problem of automated selection of unique or quasi-unique names in the semantics of sites (online stores) is known. The main drawback of the

With the latter problem, a binary measure of similarity, proposed by Paul Jaccard, copes well. At the same time, an essential shortcoming of the Jakkard algorithm is the criticality to errors in writing words in the compared texts: so if in one of the lines the word "corresponding" will have an error - "coresponding", the result of calculating the Jacquard coefficient is, on average,

The

The algorithm considered below provides insensitivity (invariance) to word rearrangement in compared phrases and is based on calculating the strict inclusion and the intersection between three finite sets. One set of

A = {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}, S_{1}= {s_{1}, s_{2}, s_{3}... s_{n}}, S_{2}= {s_{1}, s_{2}, s_{3}... s_{m}},

wherein,

The normalized metric in the positive range of numbers from 0 to 1 takes the form: The

The

For clarity in the figure below, using the Euler-Venn diagram, we show the desired relation of three sets:

In a PHP program, the power of the sets

$str = preg_replace('/[^a-z]/', '', $str); $char1 = count_chars($str, 1); foreach ($char1 as $key => $val) { $ch = chr($key); $hschar1{$ch} = $val; } $str = preg_replace('/[^a-z]/', '', $str); $char2 = count_chars($str, 1); foreach ($char2 as $key => $val) { $ch = chr($key); $hschar2{$ch} = $val; },

$streng = str_replace ($r, $l, strtolower ($str, 'UTF-8')),where

The

The set of Latin characters

$alfb = "abcdefghijklmnopqrstuvwxyz"; $alfbarr = preg_split("//",$alfb); $sim = 0; foreach($alfbarr as $aw){ if($aw!=""){ $val = $hschar1{$aw}; if($val){ $vv=$val; }else{ $vv=0; } $val2 = $hschar2{$aw}; if($val2){ $vv2=$val2; }else{ $vv2=0;} $sim += abs($vv - $vv2); } } $percent = 100*(1 - $sim/$lengtotal),where

$str1 = "Comparison of two strings with the help of an invariant metric" and $str2 = "Comparison of two strings using a metric that is invariant to the permutation of words"is presented in Table 1:

src = "images / invar_table1.gif" border = 0 alt = "Histogram of power values of sets of two lines" title = "Histogram of power values of sets of two lines"> |

The upper table graph refers to the line

Array ( [lengtotal] => 111 [Error.count] => 17 [Similarity] => 84.68 ),где

$alfb = "abcdefghijklmnopqrstuvwxyz0123456789".

$str1 = "Search algorithms and text analysis"; & $str2 = "Algorithm analysis text and search";

Array ( [lengtotal] => 55 [Error.count] => 1 [Similarity] => 98.18 )The result shows the registration of one error and almost 100% similarity of compared rows, despite the permutations. It is interesting to compare this result with the calculation of the Jacquard coefficient, which also ensures invariance to permutations of words. The similarity estimate for Jakkard gives a significantly understated result: 0.55-0.63 (accuracy of 0.05, 400 iterations). The

The

The "three-sets" method provides the invariability to word rearrangements and demonstrates resistance to errors in writing texts. An experimental comparison with the Levenshtein metric, the Oliver algorithm, and the Jacquard similarity coefficient has shown the

The

- String comparison functions. PHP Reference.
- Distance of Levenshtein.
- Application of fuzzy search algorithms in PHP
- The function levenshtein (). A guide to PHP.
- Similar_text () function. A guide to PHP.
- Calculation of the metric is invariant to a permutation of words. The program in PHP.
- Fuzzy search in the text and in the dictionary
- MinHash — we identify similar sets
- Н.В. Неелова, Сычугов А. А. The calculation of fuzzy takes by the Jacquard formula taking into account synonymous substitutions and stop words [russian]. Тула: Известия ТулГУ, 2009. To article
- Оценка алгоритмов сходства строковых переменных. Зеркальное регулярное выражение