// Copyright 2013, University of Freiburg, // Chair of Algorithms and Data Structures. // Author: Hannah Bast . // NOTE: this is a code design suggestion in pseudo-code. It is not supposed to // be compilable in any language. You have to translate it to Java or C++ // yourself. The purpose of this file is to suggest a basic design and settle // questions you might have on what exactly your code is supossed to do. // Kmeans clustering of a set of documents. // // IMPLEMENTATION NOTE: The representation of the documents is key for an // efficient and simple implementation. In the methods below, two // representations are used: // DENSE = a document is represented as an Array of size m, where m is // the total number of distinct terms in the given text collection. Each entry // is a non-negative score for that term in that document. If the term does not // occur in that document, the entry is 0. // SPARSE = a document is represented as a Map of a given size M, // where M << m. Here, only the M terms with the largest non-zero scores are // stored. If a document contains M' < M distinct terms, the number of // entries is M'. class KmeansClustering { // PUBLIC MEMBERS // Create document vectors (in SPARSE representation) from a given inverted // index with BM25 scores.. // // Also fill the terms Array whereas indices correspond to term Ids used // in the maps for sparse documents. // // IMPLEMENTATION NOTE: For the class InvertedIndex, copy your implementation // from Exercise Sheet 2, or the corresponding master solution. // Remove all unnecessary stuff from the class which you // do not need for this exercise sheet. void createDocuments(InvertedIndex invertedIndex); // Compute k clusters for the documents from createDocuments using k-means. // // IMPLEMENTATION NOTES: // 1. Before you start, normalize the documents using normalizeDocument below // (assuming that you did not already do that in createDocuments). // 2. As initial centroids, pick a random subset of size k from the documents. // A simple way to pick a random subset is to initialize an array of size n // with the entries 1, ..., n, where n is the number of documents. Then, for i // = 1, ..., k pick a random position j from [i .. n] and swap the entries at // positions i and j. // 3. In step A of each iteration (re-assign to centroids) compute the n* k // similarities between the n documents and the k centroids using the method // distance below. // 4. In step B of each iteration (re-compute centroids) compute the new // centroids by a single iteration over all n documents. First compute the // centroids in DENSE representation using O(n * m) operations. Then use // truncate to get them in SPARSE representation again. Then normalize them // again. // 5. Choose a proper way to terminate you algorithm. The goal is get a // resonably low RSS as well as a not too long running time (both should be // reported on the Wiki for Exercise 3). void cluster(int k, int M); // Write centroids to file. void writeCentroidsToFile(String fileName); // PRIVATE MEMBERS // Normalize a given document in SPARSE representation such that the L2-norm // (= the sum of the squares of the entries) is 1. // // IMPLEMENTATION NOTE: In C++, make sure that you do not pass the argument by // value here, since the purpose of the method is to modify it. void normalizeDocument(Map document); // Truncate a given document in DENSE representation to the M terms with the // highest scores, and return the truncated document in SPARSE representation. // If the document contains only M' << M distinct terms, the truncated // document has only M' entries. Map truncateDocument(Array document, int M); // Compute the distance between the two documents in SPARSE representation, // assuming that they have been normalized (With normalizeDocument) before. // The distance is then simply 1 - x * y, where x * y is the dot product. // // IMPLEMENTATION NOTE: To compute the dot product between two documents in // SPARSE representation, just iterate over one of the maps, and sum up the // products of the entries for all keys (term ids) which are also in the other // map. float distance(Map x, Map y); // Gets the strings corresponding to the k terms with the highest // scores in the document. Array getTopKTerms(Map document, k); // The documents of the collection in SPARSE representation. Array> documents; // The k centroids in SPARSE representation. Array> centroids; // The terms corresponding to term Ids Array terms; }