Satoshi Yoshida* and Takuya Kida* *Hokkaido University Graduate school of Information Science and Technology Division of Computer Science

Similar documents
2. Hubs and authorities, a more detailed evaluation of the importance of Web pages using a variant of

Finite-Dimensional Linear Algebra Errata for the first printing

Benchmark: Talend Open Studio vs Pentaho Data Integrator (aka Kettle) V0.23

Single-Molecule Studies of Unlabelled Full-Length p53 Protein Binding to DNA

LALR Analysis. LALR Analysis. LALR Analysis. LALR Analysis

Improved Outer Approximation Methods for MINLP in Process System Engineering

Analytic hierarchy process-based recreational sports events development strategy research

Agilent G6825AA MassHunter Pathways to PCDL Software Quick Start Guide

More Examples and Applications on AVL Tree

Using a signature-based machine learning model to analyse a psychiatric stream of data

Math 254 Calculus Exam 1 Review Three-Dimensional Coordinate System Vectors The Dot Product

ECE 608: Computational Models and Methods, Fall 2005 Test #1 Monday, October 3, Prob. Max. Score I 15 II 10 III 10 IV 15 V 30 VI 20 Total 100

STATISTICAL DATA ANALYSIS IN EXCEL

EFFECTS OF INGREDIENT AND WHOLE DIET IRRADIATION ON NURSERY PIG PERFORMANCE

Summary. Effect evaluation of the Rehabilitation of Drug-Addicted Offenders Act (SOV)

Input from external experts and manufacturer on the 2 nd draft project plan Stool DNA testing for early detection of colorectal cancer

The Measurement of Interviewer Variance

2 nd Properties of the Exponential Functions

Evolutionary Programming

Table 1. Sequence and rates of insecticide sprays in experimental plots of apples, Columbus, Ohio, Treatment

Fast Support Vector Machines for Structural Kernels

Java Application Development

Standard Deviation and Standard Error Tutorial. This is significantly important. Get your AP Equations and Formulas sheet

The step method: A new adaptive psychophysical procedure

Appendix J Environmental Justice Populations

Community. Profile Powell County. Public Health and Safety Division

Community. Profile Big Horn County. Public Health and Safety Division

Community. Profile Yellowstone County. Public Health and Safety Division

Community. Profile Lewis & Clark County. Public Health and Safety Division

Community. Profile Missoula County. Public Health and Safety Division

Invasive Pneumococcal Disease Quarterly Report. July September 2017

Community. Profile Anaconda- Deer Lodge County. Public Health and Safety Division

Community. Profile Carter County. Public Health and Safety Division

CPSC 121 Some Sample Questions for the Final Exam

XII. HIV/AIDS. Knowledge about HIV Transmission and Misconceptions about HIV

Artificial intelligence (and Searle s objection) COS 116: 4/29/2008 Sanjeev Arora

Digital Imaging and Communications in Medicine (DICOM) Supplement 50: Mammography Computer-Aided Detection SR SOP Class

Data processing software for TGI/TGE series

10 Read and match. 11 2: : Play the game. HOME SCHOOL

Design Quadratic Patch and Cubic Patch of the Surface

BIOSTATISTICS. Lecture 1 Data Presentation and Descriptive Statistics. dr. Petr Nazarov

EVALUATION OF DIFFERENT COPPER SOURCES AS A GROWTH PROMOTER IN SWINE FINISHING DIETS 1

Reducing the Risk. Logic Model

INRODUCTION TO TREEAGE PRO

static principle: output determined by a connection with strong node dynamic principle: output (sometimes) determined by a weak (floating) node

Using Paclobutrazol to Suppress Inflorescence Height of Potted Phalaenopsis Orchids

8/1/2017. Correlating Radiomics Information with Clinical Outcomes for Lung SBRT. Disclosure. Acknowledgements

build Firm, sexy arms

An Exact Algorithm for Side-Chain Placement in Protein Design

Scientific research on the biological value of olive oil

Geographical influence on digit ratio (2D:4D): a case study of Andoni and Ikwerre ethnic groups in Niger delta, Nigeria.

WORKSHOP FOR SYRIA. A SHORT TERM PROJECT A Collaborative Map proposal Al Moadamyeh, Syria

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

Principles of Computer Science

Problem Solving Agents

Adjectives. Demonstrative adjectives are used to point out which noun is being spoken of. That book belongs to Katy. This book belongs to me.

A few other notes that may be of use.

Memory Management. What to do when coalescing fails. The Need for Relocation. Memory Compaction. Pure Swapping. Why we swap 4/15/2018

Effect of Source and Level of Protein on Weight Gain of Rats

Provider How To. Software Process Service Results

RP 9.2.2: RP 9.2.3:

Foundations of Natural Language Processing Lecture 13 Heads, Dependency parsing

INVESTIGATION OF ROUNDOFF NOISE IN IIR DIGITAL FILTERS USING MATLAB

Reports of cases of AIDS, HIV infection, and HIV/AIDS 1

Review Questions in Introductory Knowledge... 37

Numerical Integration of Bivariate Gaussian Distribution

SNJB College of Engineering Department of Computer Engineering

Study of Stress Distribution in the Tibia During Stance Phase Running Using the Finite Element Method

2018 American Diabetes Association. Published online at

Sample Exam Paper Answer Guide

Stage-Specific Predictive Models for Cancer Survivability

SUPPLEMENTARY INFORMATION

QuantiPhi for RL78 and MICON Racing RL78

Classification and Predication of Breast Cancer Risk Factors Using Id3

DICOM Conformance Statement

Potassium Intake of the U.S. Population

Maximum Likelihood ofevolutionary Trees is Hard p.1

Effectiveness of Belt Positioning Booster Seats: An Updated Assessment

A Taxonomy of Decision Models in Decision Analysis

General Microscopic Changes

The Mythos of Model Interpretability

1 What is an Agent? CHAPTER 2: INTELLIGENT AGENTS

Lecture 20: Chi Square

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

A scored AUC Metric for Classifier Evaluation and Selection

Chapter 3 Software Packages to Install How to Set Up Python Eclipse How to Set Up Eclipse... 42

Suppose we tried to figure out the weights of everyone on campus. How could we do this? Weigh everyone. Is this practical? Possible? Accurate?

A LAYOUT-AWARE APPROACH FOR IMPROVING LOCALIZED SWITCHING TO DETECT HARDWARE TROJANS IN INTEGRATED CIRCUITS

Central Algorithmic Techniques. Iterative Algorithms

Recap DVS. Reduce Frequency Only. Reduce Frequency and Voltage. Processor Sleeps when Idle. Processor Always On. Processor Sleeps when Idle

EECS150 - Digital Design Lecture 7 - Boolean Algebra II

Effect of fungicide timing and wheat varietal resistance on Mycosphaerella graminicola and its sterol 14 α-demethylation-inhibitorresistant

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Replacing Fish Meal with Soybean Meal and Brewer s Grains with Yeast in Diets for Australian Red Claw Crayfish, Cherax quadricarinatus

Question 2. The Deaf community has its own culture.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Transcription:

Stoshi Yoshid* nd Tkuy Kid* *Hokkido University Grdute school of Informtion Science nd Technology Division of Computer Science 1

Compressed Dt 01110101110111 0100101001 Serch Directly Progrm Serching on Compressed Dt Compressed Text Fixed Length Huffmn Code Vrible Length Fixed Length FF Code (Fixed length to Fixed length code) FV Code (Fixed length to Vrible length code) Input Text Vrible Length Tunstll Code VF Code (Vrible length to Fixed length code) VV Code (Vrible length to Vrible length code) 2

AIVF code using multiple prse trees [Ymmoto nd Yokoo, 2001] YY code Improves compression rtio considerbly utilizing context between blocks AIVF code [Ymmoto nd Yokoo, 2001] utilizing unused codewords Tunstll code [Tunstll, 1967] 3

multiple prse trees huge time nd memory 0010110 1001001 0001 reltively to the number of kind of chrcters (k) in the input text VMA tree k-1 prse trees proposing method 4

We cn reduce the totl number of nodes in comprison to YY code by using VMA tree. We cn reduce the totl number of nodes nd compression time considerbly lso in experiments. We found n upper bound nd lower bound of the number of nodes in VMA tree. 5

Let Σ be finite lphbet. Elements in lphbet re sorted in descending order of their probbilities. We ssume informtion source is memoryless. We simply sy encoding encoding to the sequence on {0, 1}. bbbbc Σ={, b, c} 6

input text: bbbbc Ech brnch is lbeled by symbol in Σ={, b, c}. 001 b 010 000 c 011 101 Ech lef node nd incomplete internl node hs codeword. b 100 c 110 Incomplete internl node An internl node which doesn t hvekchildren. 111 Input text is prsed into blocks nd we output codeword corresponding to the block. compressed dt sequence: 001 101 101 101 101 111 7

We switch multiple prse trees ccording to the context. b c b c b c 101 111 b c 111 001 010 011 100 110 b c 011 100 110 000 101 000 001 010 8

Problem: time nd spce consuming We construct k 1 prse trees. We cn reduce totl number of nodes by shring nodes. We need to mrk ech node n in VMA tree in order to tell which trees the node belongs to. We cn relize tht by holding the lest i such tht n belongs to T i. T 0 T 1 T 2 T k 2 multiple prse trees VMA tree 9

From this theorem, we cn tell which trees node belongs to esily. Theorem (i) Let S be the subtree of T j i tht consists of ll the nodes under the node corresponding with j, which is ( i+1) (i) direct child of the root. Then S i + j completely covers S i + j. ( i) ( i+ 1) We denote this reltion by S ps. i+ j i+ j T i T i +1 ( i) S i + 1 ( i) S i + 2 ( i+ 1) S i + 2 ( i) ( i) Sk S (i) 2 k 1 ( i+ 1) ( i+ 1) S ( i+1) 2 k 1 10

S 1 We cll the integrted prse tree VMA tree. T0 T T 1 k 2 2 1 S 2 2 1 ( k 2) 1 From the theorem, we hve: ( k 2) S S S S S S 1 2 3 k 2 k 1 k = S ps ps plps plps 1, 2 3 M plps T V = S 2 ps, (2) 3 ( k 3) k 2 ( k 2) k 1 ( k 2) k = S 3 = S = S = S k, k 2 k 1.,, multiple prse trees S1 2 1 VMA tree 11

Theorem The number of reduced nodes is not less 1 2 1 2 3 thn. 12

T 1 T k 3 T k 2 S 1 S 2 T 0 2 1 S 2 S 3 2 1 1 2 1 2 3 ( k 3) 2 ( k 3) 1 k 1 k 2 2 Summtion: ( k 3) T V ( k 2) 1 ( k 2) Ech reduced subtree hs t lest one node. S 1 S 2 ( k 3) 2 ( k 2) 1 ( k 2) 13

Theorem The number of nodes in VMA tree is 1 1 not less thn. 14

Lrgest subtree nd it remins in VMA tree. The VMA tree is smllest when Pr Pr Pr. 15

T V #of codewords: S 1 S 2 1 ( k 3) 2 3 ( k 2) 1 2 ( k 2) 2 #of nodes: 1 1 1 1 1 1 1 1 1 1 Summtion is not less thn. 16

Comprison Totl number of nodes (Exp1) Compression times (Exp2) Totl number of nodes nd upper bound nd lower bound of nodes in VMA tree on rndom sequence (Exp3) Algorithms YY coding (YY) Encoding using VMA tree (VMA) Environments CPU: Intel Pentium 4 Processor 3.0GHz Hyper Threding Memory: 2GB OS: Debin GNU/Linux 5.0 Lnguge: C++ Compiler: g++4.3 Codeword length: 12 bits 17

The Cnterbury Corpus file size (in bytes) k content lice29.txt 152,089 74 English text syoulik.txt 125,179 68 Shkespere cp.html 24,603 86 HTML source fields.c 11,150 90 C source grmmr.lsp 3,721 76 LISP source kennedy.xls 1,029,744 256 Excel Spredsheet lcet10.txt 424,754 84 Technicl writing plrbn12.txt 481,861 81 Poetry ptt5 513,216 159 CCITT test set sum 38,240 255 SPARC Executble xrgs.1 4,227 74 GNU mnul pge http://corpus.cnterbury.c.nz/descriptions 18

VMA YY 10 3 32 times 1200 1000 7 times 19 800 600 400 totl mount of nodes 200 0 k : 74 68 86 90 76 256 84 81 159 255 74

70 60 50 40 30 20 10 0 18 times 3 times k : 74 68 86 90 76 256 84 81 159 255 74 VMA YY 20 compression time (sec)

1200 the number of nodes on uniform distribution 1200 the number of nodes on Zipf distribution totl number of nodes 1000 800 600 400 200 totl number of nodes 1000 800 VMA 600 YY upper 400 bound lower bound 200 VMA YY upper bound lower bound 0 0 32 64 96 128 160 192 224 256 lphbet size 0 0 32 64 96 128 160 192 224 256 lphbet size 21

We showed tht we cn reduce the totl number of nodes in comprison to YY code by using VMA tree. We lso showed tht we cn reduce the totl number of nodes nd times considerbly in experiments. We found n upper bound nd lower bound of the number of nodes in VMA tree. Applying this ide to STVF coding [Kid, DCC2009] Finding tighter bounds 22