// Copyright 2013, University of Freiburg, // Chair of Algorithms and Data Structures. // Author: Hannah Bast . // Class that realizes error-tolerant prefix search. class ErrorTolerantSearch { // PUBLIC MEMBERS. // Read input strings from file, where each line consists of one input string // followed by a TAB followed by a score. The input strings must contain // neither TABs nor newlines. The scores are used for ranking, see method // findMatches below. void readStringsFromFile(String fileName); // Build k-gram index from the input strings read with readStringsFromFile. // Remember to transfor each q-Gram to lowercase. void buildQgramIndex(int q); // For the given query prefix, find all input strings to which the prefix edit // distance is at most the given delta. Rank the matches by the scores read in // readStringsFromFile. Return the top-k matches in that order. // As explained in the lecture, proceed as follows. First use the q-gram index // to compute a set of candidate matches, using method computeUnion below. Then // for each candidate match compute the exact PED to see whether it is really a // match. Array findMatches(String prefix, int delta, int k); // PRIVATE MEMBERS. // Compute the prefix edit distance from the given query to the given string. // Check if the given string has a PED <= delta. // Note that the prefix edit distance is not symmetric. bool checkPrefixEditDistance(String prefix, String string, int delta); // Compute the union of the given inverted lists from the q-gram index. In the // result, along with each input string id, also store the number of q-grams // (from the input lists) containing that string. // IMPLEMENTATION NOTE 1: the union can be computed in much the same way as the // intersection. You can use a sequence of pairwise unions, as done for Exercise // Sheet 1 with intersections. Or you can compute the union of all lists at // once using a priority queue. The second one will be faster. // IMPLEMENTATION NOTE 2: in C++ one has to take care how to pass this // While a reference to something complex is good in general, // the one callign this method (findMatches) shouldn't copy index lists, either. // A proper parameter would be (const vector*>& list). Array> computeUnion(Array> lists); // The strings and their scores. Array strings; Array scores; // The q used for the index, stored because it is needed at query time. int q; // The inverted lists of the q-gram index. For each q-gram that occurs in one // of the input strings, contains the list of ids of all input strings // containing that q-gram. Map> qgramIndex; }