Self-Indexed Grammar-Based Compression

Claude, Francisco; Navarro, Gonzalo

doi:10.3233/FI-2011-565

Self-Indexed Grammar-Based Compression

Article type: Research Article

Authors: Claude, Francisco | Navarro, Gonzalo

Affiliations: David R. Cheriton School of Computer Science, University of Waterloo, Canada. fclaude@cs.uwaterloo.ca | Department of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl

Note: [] Address for correspondence: David R. Cheriton School of Computer Science, University of Waterloo. 200 University Avenue West, Waterloo, Ontario, Canada. N2L 3G1. Funded in part by NSERC Canada, Go-Bell Scholarships program and David R. Cheriton Graduate Scholarships program.

Note: [] Funded in part by Millennium Institute on Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile.

Abstract: Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T[1, u], then an SLP-compressed representation of T requires 2n log2 n bits. For that same SLP, our self-index takes O(n log n) + n log2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.

Keywords: Grammar-based Compression, Straight-Line Programs, Self-Indexes, Compressed Text Databases, Highly Repetitive Sequences, Pattern Matching, Data Structures

DOI: 10.3233/FI-2011-565

Journal: Fundamenta Informaticae, vol. 111, no. 3, pp. 313-337, 2011

Published: 2011

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia