Satoshi Yoshida* and Takuya Kida* *Hokkaido University Graduate school of Information Science and Technology Division of Computer Science

Stoshi Yoshid* nd Tkuy Kid* *Hokkido University Grdute school of Informtion Science nd Technology Division of Computer Science 1

Compressed Dt 01110101110111 0100101001 Serch Directly Progrm Serching on Compressed Dt Compressed Text Fixed Length Huffmn Code Vrible Length Fixed Length FF Code (Fixed length to Fixed length code) FV Code (Fixed length to Vrible length code) Input Text Vrible Length Tunstll Code VF Code (Vrible length to Fixed length code) VV Code (Vrible length to Vrible length code) 2

AIVF code using multiple prse trees [Ymmoto nd Yokoo, 2001] YY code Improves compression rtio considerbly utilizing context between blocks AIVF code [Ymmoto nd Yokoo, 2001] utilizing unused codewords Tunstll code [Tunstll, 1967] 3

multiple prse trees huge time nd memory 0010110 1001001 0001 reltively to the number of kind of chrcters (k) in the input text VMA tree k-1 prse trees proposing method 4

We cn reduce the totl number of nodes in comprison to YY code by using VMA tree. We cn reduce the totl number of nodes nd compression time considerbly lso in experiments. We found n upper bound nd lower bound of the number of nodes in VMA tree. 5

Let Σ be finite lphbet. Elements in lphbet re sorted in descending order of their probbilities. We ssume informtion source is memoryless. We simply sy encoding encoding to the sequence on {0, 1}. bbbbc Σ={, b, c} 6

input text: bbbbc Ech brnch is lbeled by symbol in Σ={, b, c}. 001 b 010 000 c 011 101 Ech lef node nd incomplete internl node hs codeword. b 100 c 110 Incomplete internl node An internl node which doesn t hvekchildren. 111 Input text is prsed into blocks nd we output codeword corresponding to the block. compressed dt sequence: 001 101 101 101 101 111 7

We switch multiple prse trees ccording to the context. b c b c b c 101 111 b c 111 001 010 011 100 110 b c 011 100 110 000 101 000 001 010 8

Problem: time nd spce consuming We construct k 1 prse trees. We cn reduce totl number of nodes by shring nodes. We need to mrk ech node n in VMA tree in order to tell which trees the node belongs to. We cn relize tht by holding the lest i such tht n belongs to T i. T 0 T 1 T 2 T k 2 multiple prse trees VMA tree 9

From this theorem, we cn tell which trees node belongs to esily. Theorem (i) Let S be the subtree of T j i tht consists of ll the nodes under the node corresponding with j, which is ( i+1) (i) direct child of the root. Then S i + j completely covers S i + j. ( i) ( i+ 1) We denote this reltion by S ps. i+ j i+ j T i T i +1 ( i) S i + 1 ( i) S i + 2 ( i+ 1) S i + 2 ( i) ( i) Sk S (i) 2 k 1 ( i+ 1) ( i+ 1) S ( i+1) 2 k 1 10

S 1 We cll the integrted prse tree VMA tree. T0 T T 1 k 2 2 1 S 2 2 1 ( k 2) 1 From the theorem, we hve: ( k 2) S S S S S S 1 2 3 k 2 k 1 k = S ps ps plps plps 1, 2 3 M plps T V = S 2 ps, (2) 3 ( k 3) k 2 ( k 2) k 1 ( k 2) k = S 3 = S = S = S k, k 2 k 1.,, multiple prse trees S1 2 1 VMA tree 11

Theorem The number of reduced nodes is not less 1 2 1 2 3 thn. 12

T 1 T k 3 T k 2 S 1 S 2 T 0 2 1 S 2 S 3 2 1 1 2 1 2 3 ( k 3) 2 ( k 3) 1 k 1 k 2 2 Summtion: ( k 3) T V ( k 2) 1 ( k 2) Ech reduced subtree hs t lest one node. S 1 S 2 ( k 3) 2 ( k 2) 1 ( k 2) 13

Theorem The number of nodes in VMA tree is 1 1 not less thn. 14

Lrgest subtree nd it remins in VMA tree. The VMA tree is smllest when Pr Pr Pr. 15

T V #of codewords: S 1 S 2 1 ( k 3) 2 3 ( k 2) 1 2 ( k 2) 2 #of nodes: 1 1 1 1 1 1 1 1 1 1 Summtion is not less thn. 16

Comprison Totl number of nodes (Exp1) Compression times (Exp2) Totl number of nodes nd upper bound nd lower bound of nodes in VMA tree on rndom sequence (Exp3) Algorithms YY coding (YY) Encoding using VMA tree (VMA) Environments CPU: Intel Pentium 4 Processor 3.0GHz Hyper Threding Memory: 2GB OS: Debin GNU/Linux 5.0 Lnguge: C++ Compiler: g++4.3 Codeword length: 12 bits 17

The Cnterbury Corpus file size (in bytes) k content lice29.txt 152,089 74 English text syoulik.txt 125,179 68 Shkespere cp.html 24,603 86 HTML source fields.c 11,150 90 C source grmmr.lsp 3,721 76 LISP source kennedy.xls 1,029,744 256 Excel Spredsheet lcet10.txt 424,754 84 Technicl writing plrbn12.txt 481,861 81 Poetry ptt5 513,216 159 CCITT test set sum 38,240 255 SPARC Executble xrgs.1 4,227 74 GNU mnul pge http://corpus.cnterbury.c.nz/descriptions 18

VMA YY 10 3 32 times 1200 1000 7 times 19 800 600 400 totl mount of nodes 200 0 k : 74 68 86 90 76 256 84 81 159 255 74

70 60 50 40 30 20 10 0 18 times 3 times k : 74 68 86 90 76 256 84 81 159 255 74 VMA YY 20 compression time (sec)

1200 the number of nodes on uniform distribution 1200 the number of nodes on Zipf distribution totl number of nodes 1000 800 600 400 200 totl number of nodes 1000 800 VMA 600 YY upper 400 bound lower bound 200 VMA YY upper bound lower bound 0 0 32 64 96 128 160 192 224 256 lphbet size 0 0 32 64 96 128 160 192 224 256 lphbet size 21

We showed tht we cn reduce the totl number of nodes in comprison to YY code by using VMA tree. We lso showed tht we cn reduce the totl number of nodes nd times considerbly in experiments. We found n upper bound nd lower bound of the number of nodes in VMA tree. Applying this ide to STVF coding [Kid, DCC2009] Finding tighter bounds 22