Affiliations: [a] National Center of Biotechnology Information, National Institutes of Health, Bethesda, Maryland, USA. E-mail: ayush.singhal@nih.gov | [b] Department of Computer Science and Engineering, University of Minnesota, Twin Cities, Minneapolis, Minnesota, USA. E-mail: srivasta@cs.umn.edu
Abstract: Scientific datasets play a crucial role in data-driven research. While there are several repositories that curate public datasets, several more datasets and their usage is “hidden” in the research publications. Hence, discovering a relevant dataset for a research topic requires in-depth investigation of several publications, tracking dataset usage and in-exhaustive literature search. To this end, a search engine to directly handle the research dataset discovery problem is extremely useful for the scientific community. In this work, we define an important paradigm of dataset search known as “dataset discovery in application context”. Unlike dataset look-up type search where the user looks up for dataset in a repository, application context based search corresponds to search without information about the name of the dataset. Such searches arise when the user is looking a best fit dataset for his research problem. We show that in this paradigm of search, conventional methods of indexing the little text about the dataset description do not work due to lack of application text content within the description text for a dataset. To alleviate this problem we propose two models of search, namely, (1) a user profile based search and (2) a keyword based search. We show that in both these models the dataset discovery is done in the application context by leveraging information from open source web resources such as scholarly articles repositories and academic search engines. The performance of the proposed models were tested with simulated test queries (user profiles) as well as with real world user studies.
Keywords: Search engine, text mining, context generation, dataset search