Affiliations: [a] Department of Information Technology Management, University at Albany, SUNY, USA. E-mail: hong@albany.edu | [b] Department of Management Science and Information Systems, Rutgers University, USA. E-mail: jsvaidya@business.rutgers.edu | [c] Department of Operations and Management Information Systems, Santa Clara University, USA. E-mail: hlu@scu.edu | [d] Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Canada. E-mail: l_wenmin@ciise.concordia.ca
Abstract: To make the search engine more user-friendly, commercial search engines commonly develop applications to provide suggestion or recommendation for every posed query. Clustering semantically similar queries acts as an essential prerequisite to function well in those applications. However, clustering queries effectively is quite challenging, since they are usually short, incomplete and ambiguous. Existing prevalent clustering methods, such as K-Means or DBSCAN cannot guarantee good performance in such a highly dimensional environment. Through analyzing users’ click-through query logs, hierarchical agglomerative clustering gives good results but is computationally quite expensive. This paper identifies a novel feature for clustering search queries based on a key insight – queries’ top ranked search results can themselves be used to quantify query similarity. After investigating such feature, we propose a new similarity metric for comparing those diverse queries. This facilitates us to develop two very efficient and accurate algorithms integrated in query clustering. We conduct comprehensive experiments to compare the accuracy of our approach against the known baselines along two dimensions: 1) quantifying the cohesion/separation of clustered queries, and 2) justifying the results by real-world Internet users. The experimental results demonstrate that our two algorithms and the similarity metric can generate more accurate results within a significantly shorter time.