Project Type | Bachelor Project |

Project Name | Efficient code for (de)compression |

Department of Project | Chair for Algorithms and Data Structures |

Chair Professor | Prof. Dr. Hannah Bast |

Duration | 21.10.2015 - 23.1.2016 |

Supervisor | Björn Buchhold |

Participant | Zhiwei Zhang |

Department | Computer Science |

Univesity | University of Freiburg |

In this project I implemented the Elias-Gamma algorithm, the Elias-Delta algorithm, the Golomb algorithm, the Variable-Bytes algorithm and the mainly optimized decompression function of the Simple8b algorithm and at the same time compared them to find out the "best appropriate algorithm(s)" for the search engine Broccoli.

Our semantic search engine Broccoli and its successor (under development) is closely related to compressed lists a lot. Current implementations have high speeds but are still far from optimal and have never been compared with other available compression schemes. Goal of the project is to tune the current (C++) implementation for speed, to evaluate the literature for promising alternatives and to compare them.

Write ${\color{blue}\lfloor\log_{2}{x}\rfloor}$ zeros, then ${\color{blue}x}$ in binary

like:- Prefix-free, because the number of initial zeros tells us exactly how many bits of the code come afterwards.
Code for ${\color{blue}x}$ has a length of exactly ${\color{blue}2\cdot\lfloor\log_{2}{x}\rfloor+1}$ bits.

$\\{\color{blue}1}\rightarrow{\color{green}1}\\{\color{blue}2}\rightarrow{\color{red}0}{\color{green}10}\\{\color{blue}3}\rightarrow{\color{red}0}{\color{green}11}\\{\color{blue}4}\rightarrow{\color{red}00}{\color{green}100}\\\vdots\\{\color{blue}10}\rightarrow{\color{red}000}{\color{green}1010}$

Write ${\color{blue}\lfloor\log_{2}{x}\rfloor+1}$ in Elias-Gamma, followed by ${\color{blue}x}$ in binary but

**without**the leading 1

like:Elias-Delta is also prefix-free and the length of the code length is ${\color{blue}\lfloor\log_{2}{x}\rfloor+2\log_{2}{\log_{2}{x}}+O(1)}$ bits.

$\\{\color{blue}1}\rightarrow{\color{green}1}\\{\color{blue}2}\rightarrow{\color{red}010}{\color{green}0}\\{\color{blue}3}\rightarrow{\color{red}010}{\color{green}1}\\{\color{blue}4}\rightarrow{\color{red}011}{\color{green}00}\\\vdots\\{\color{blue}10}\rightarrow{\color{red}00100}{\color{green}010}$

Comes with an integer parameter ${\color{blue}M}$, called

**modulus**.Write ${\color{blue}x}$ as ${\color{blue}q\cdot M+r}$, where ${\color{blue}q=\frac{x}{M}}$ and ${\color{blue}r=x \mod M}$

The code for ${\color{blue}x} is then the concatenation of:

${\color{blue}q}$ written in unary with ${\color{blue}0}$s.

a single 1 (as a delimiter).

${\color{blue}r}$ written in binary.

like:

$\\M=16\\{\color{blue}1}\rightarrow{\color{blue}\underline{1}}{\color{green}0001}\\{\color{blue}70}\rightarrow{\color{red}0000}{\color{blue}\underline{1}}{\color{green}0100}$

Idea: use

**whole bytes**, in order to avoid the (expensive) bit fiddling needed for the previous schemes.Use one bit of each byte to indicate whether this is the last byte in the current code or not.

like:

$\\x=521={\color{red}4}\cdot128+{\color{red}9}\\\text{VB-code for x: }{\color{green}\underline{1}}{\color{red}0000100}\:{\color{green}\underline{0}}{\color{red}0001001}$

- Each code is unsigned 64 bits in binary form.
- and contains two parts:
- 4 bits selector which explains how many words this code contains.
- and 60 bits content.

There are 16 different selectors.

Selector value | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |

Item width | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 12 | 15 | 20 | 30 | 60 |

Group size | 240 | 120 | 60 | 30 | 20 | 15 | 12 | 10 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |

Wasted bits | 60 | 60 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |

- We use the selector-0 express 240 continuous 0s and selector-0 express 120 continuous 0s. like:

$\\\text{data list: }888\,{\color{blue}56}\,{\color{green}1}\,{\color{red}0}\,{\color{cyan}0}\,{\color{magenta}0}\\\text{Simple8b-Code: }\\ 1101111000\,{\color{blue}0000111000}\,{\color{green}0000000001}\,{\color{red}0000000000}\,{\color{cyan}0000000000}\,{\color{magenta}0000000000}\,\underline{1010}$

There are several known compression algorithms such as Elias-Gamma, Elias-Delta, Golomb and Variable-Bytes. And there is also a new compression algorithm Simple8b. My supervisor has already implemented the Simple8b algorithm, and what I need to do was to find a way to optimize the implementation and to achieve a shorter compress time as well as shorter decompress time.

First of all I wrote three kinds of functions:

- the first kind of function could produce a list of given quantity of numbers between given minimize number and maximize number. In this list all the numbers conform to uniform distribution.
- the second kind of function could also produce a list of given quantity of numbers. But in this list all the numbers conform to geometric distribution.
- the third kind of function just combines the first two functions to simulate the real data.

In our case the real data contain a lot of continuous zeros.

Then I implemented all the algorithms except Simple8b in two different ways in C++:

- one way is to use pointer as the list type.
- and another way is to use vector as the list type.

So with the comparison we could directly observe how much more efficient is the code using pointer than the code using vector.

Moreover I modified the compression function and decompression function.

For the decompress function I modified it following the ideas, suc as:

- Cut off the if-branches in for-loop.
- Announce the same data-type variable together.
- Avoid repeating of same variable annoucement.
- Make the judgement in if-branches more efficient.

After the modifying I produced three versions of the decompresssion function.

- Cut off all the if-branches, decompress the elements in encode list one by one using the same decompress form. File: Simple8bCode.h, Class: Simple8bCode
- Change the decompress form to reduce the unnecessary calculations. File: Simple8bCodeNew.h, Class: Simple8bCodeNew1
- Use 2 for-loop to realize the role of if-branches in the old decompression function. File: Simple8bCodeNew.h, Class: Simple8bCodeNew2

For the compression function which provided by my supervisor I have changed it in several aspects and produced only one version of Simple8b algorithm.

- Instead of "continue" after encoding 120 continuous 0s update the variable "itemLeft" and directly check the selectors from selector 2. "
- Combine one if-branches with one while-loop and move the "asser"-check behind, so that the times of doing the "asser"-check are reduced.
- Change the judgement of checking whether we have compressed as many as possible plaintext in a word of encode list to be more efficient. File: Simple8bCode.h, Class: Simple8bCode All the variants above are using pointer.
- Adding an if-loop to make sure that the quantity of works in a code is exactly the same number which the selector of this code represents. Using the verified decompression function. This variant uses vector. File: Simple8bCodeVector.h, Class: Simple8bCodeVector

Afterwards I run the tests of simulation on all the algorithms. There is a difference between Elias-Gamma, Elias-Delta and other algorithms. Without the possibility of compressing zero in Elias-Gamma and Elias-Delta I can only run the non-zero data using the two algorithms. Since the running time of Elias-Gamma and Elias-Delta is very long and our real data contain a lot of zeros, Elias-Gamma and Elias-Delta are not important to us. Due to the long running time, Golomb algorithm is also not important.

At last I tested only the Variable-Bytes and all the versions of Simple8b algorithms on the real data.

I tested all the algorithms in 3 aspects: **compress time**, **decompress time**, **used bytes**.

In the case of uniform distribution and the combination of uniform distribution and geometric distribution the compress time of Variable-Bytes algorithm is the shortest. In this case the Golomb algorithm has used least bytes. Significantly, the used bytes of Variable-Bytes algorithm are less than the Simple8b algorithm.

The decompress time of Simple8b algorithm is the shortest in all the cases.

In the case of geometric distribution the Simple8b algorithm has the same compress time with Variable-Bytes, which is the shortest in all the algorithm except Elias-Gamma and Elias-Delta. In this case the used bytes are the least.

I only tested the **Variable-Bytes algorithm and all the versions of Simple8b algorithm** in the same 4 aspects as in simulation.

There are six different data types: **docids.ent**, **docides.noent**, **scores.ent**, **scores.noent**, **wordids.ent**, **wordids.noent**.

In data types docids.ent, docides.noent, scores.ent, wordids.ent and wordids.noent the Variable-Bytes algorithm has the shortest compress time.

The Simple8b algorithm with the optimal decompress function has the shortest decompress time in data types docids.ent, docides.noent, wordids.ent and wordids.noent. And the Simple8b algorithm uses less bytes than Variable-Bytes algorithm in data types docids.ent, docides.noent, scores.ent, scores.noent and wordids.noent.

The optimal compression function in Simple8b algorithm (Simple8bCode.h) has the shortest compress time in the scores.noent data type. Significantly, the optimal Simple8bCode has always the shorter compression time and decomression time than the old Simple8bCode from Björn.

In the average case in all data types Variable-Bytes is more efficient than Simple8b in the aspect of compress time, and optimal Simple8b is more efficient than Variable-Bytes in the aspect of decompress time. Moreover the optimal Simple8bCode(Simple8bCode.h) has the shortest compression time and shortest decompression time among all the variants of Simple8b algorithm.

Description | Data |
---|---|

The Elias-Delta algorithm using pointer. | EliasDeltaEncodingPointer.h |

The Elias-Delta algorithm using vector. | EliasDeltaEncodingVector.h |

The Elias-Gamma algorithm using pointer. | EliasGammaEncodingPointer.h |

The Elias-Gamma algorithm using vector. | EliasGammaEncodingVector.h |

The Golomb algorithm using pointer. | GolombEncodingPointer.h |

The Golomb algorithm using vector. | GolombEncodingVector.h |

The one variant of the Simple8b algorithm using pointer. | Simple8bCode.h |

The one variant of the Simple8b algorithm using pointer and containing several object methods. | Simple8bCodeClass.h |

The other six variants of the Simple8b algorithm using pointer. | Simple8bCodeNew.h |

The version of the Simple8b algorithm using pointer from Björn Buchhold. | Simple8bCodeOld.h |

The Simple8b algorithm using vector. | Simple8bCodeVector.h |

The Variable-Bytes algorithm using pointer. | VariableByteEncodingPointer.h |

The Variable-Bytes algorithm using vectorß. | VariableByteEncodingVector.h |

All the files of the project. | code.zip |

Description | Data |
---|---|

All the test results of the project. | result.zip |

Article | Origin | Author |
---|---|---|

Index compression using 64-bit words | Softw. Pract. Exper. 2010; 40:131–147 | Vo Ngoc Anh and Alistair Moffat |

lecture-04.pdf of Information Retrieval 2015/2016 | 11-14 | Prof. Dr. Hannah Bast |