# Sachin Garg

Programming is like composing poetry, its like making music...

# 64-bit Range Coding and Arithmetic Coding

Summary: This article explains the limitations faced by Arithmetic coder and Range coder on 32-bit platforms and how using 64-bit variables can improve the compression results significantly. With 64-bit processors becoming increasingly popular, it makes sense to move on and use the available processing power.

#### Effects of Precision on Entropy Coding

Conceptually both Arithmetic coding and Range coding are best described using floating point mathematics. But when it comes to actually implementing them for practical use, it turns out that this is best accomplished using standard integer math. No floating point math is required, nor would it help to use it. What is used instead is an incremental transmission scheme, where fixed size integer state variables receive new bits in at the low end and shift them out the high end, forming a single number that can be as many bits long as are available on the computer's storage medium. (Read Mark Nelson's article on arithmetic coding for detailed introduction to incremental transmission scheme, the same background applies to range coding too.)

The size of integer variables has two independent impacts:

1. It directly effects the accuracy of calculations during entropy encoding.
2. It effects the accuracy with which the model can keep track of probabilities

The limitation faced on 32-bit platforms can be solved using 64-bit arithmetic. Most current compilers and systems allow use of 64-bit arithmetic (implemented as unsigned long long or __int64) and 64-bit processors are also becoming increasingly popular.

Here I will be trying to evaluate the effect of both these impacts. The results are compared on Calgary Corpus and Large Canterbury Corpus, using order-0 model with 4 entropy coder implementations:

• 32-bit arithmetic coder
• 64-bit arithmetic coder
• 32-bit range coder
• 64-bit range coder

Details of the implementations used are described where needed. The source code is also available at the bottom of this page.

#### Effect of Precision on Accuracy of Calculations in Entropy Coding

The size of the integer variables determines the number of bits which are participating in the calculations at a given time. The size greatly effects the precision/accuracy of calculations and effects the end result. Reduced accuracy with 32-bit variables, as expected, leads to more than necessary bits being generated by the encoder, thus adversely effecting the compression ratio achieved. Accuracy can be improved by using 64-bit variables.

The results are compared on Calgary Corpus and Large Canterbury Corpus, using order-0 model with 4 entropy coder implementations. Besides providing a comparison of precision of 32-bit vs 64-bit implementations, this also provides a comparison of Range coder vs Arithmetic coder.

 File Name Original Size Size after Compression with Order-Zero Model 32-bit Arithmetic Coder 64-bit Arithmetic Coder 32-bit Range Coder 64-bit Range Coder BIB 111261 72791 72791 72815 72813 BOOK1 768771 436914 436911 437063 437000 BOOK2 610856 364778 364776 364896 364852 GEO 102400 72405 72404 72426 72424 NEWS 377109 244492 244492 244571 244544 OBJ1 21504 16039 16039 16045 16048 OBJ2 246814 187307 187304 187357 187340 PAPER1 53161 33130 33130 33144 33144 PAPER2 82199 47541 47541 47557 47561 PAPER3 46526 27392 27391 27401 27404 PAPER4 13286 7999 7999 8004 8007 PAPER5 11954 7561 7560 7565 7569 PAPER6 38105 23839 23839 23848 23850 PIC 513216 75060 75056 75100 75079 PROGC 39611 25923 25923 25932 25932 PROGL 71646 42616 42616 42633 42628 PROGP 49379 30208 30208 30219 30221 TRANS 93695 64341 64341 64364 64361

The results on Large Canterbury Corpus:

 File Name Original Size Size after Compression with Order-Zero Model 32-bit Arithmetic Coder 64-bit Arithmetic Coder 32-bit Range Coder 64-bit Range Coder E.coli 4638690 1176867 1176867 1177665 1177304 world192.txt 2473400 1545446 1545440 1545915 1545733 bible.txt 4047392 2202010 2202002 2202798 2202477

Note: In each of the 4 encoders, the frequency counts in model were rescaled to half when Sum of frequencies reached 0x3FFF. This is done because, as will be discussed soon, the maximum range of frequencies which the model can keep, is limited by the precision of integer variables used. Here 32-bit Arithmetic Coder has the most restrictive "maximum range", hence that range is used for all encoders.

As the gains achieved by using 64-bit implementations are very small on Calgary corpus files and considering that the test files are small in size, it is worth mentioning here that 64-bit encoders have to flush 8 bytes at end, whereas the 32-bit encoders have to flush on 4 bytes. The also explains the slight inconsistency in results where using 64-bit implementation increased the compressed size.

Conclusion: It can be concluded that using 64-bit implementations do improve the precision/accuracy of calculations in entropy coder, but the gains due to this are not significant enough for small data sets. On larger files, a definite improvement can be seen. Also, effect of using 64-bit implementation is more profound on range coder.

#### Effect of Entropy Coder Precision on Model

The size of the integer variables also determines the maximum precision with which the model can represent the probabilities. For entropy coding, once the character probabilities are known, the individual symbols need to be assigned a range along a "probability line". The 'maximum range' that can be used determines the accuracy with which we can represent the probabilities. This 'maximum range' depends upon the size of the integer variables used for entropy coding, in both arithmetic and range coder.

This is a sever limitation that forces us to use approximations, frequency counts are rescaled (usually halved) whenever they are near the limit. As now probability estimation is not accurate, the number of bits generated is more than optimal.

In adaptive models, if the nature of source data changes, then this halving of frequency counts can improve compression. Because in such cases it effectively translates into "reduce weight of statistics gathered till now and give more importance to new statistics". However, when data is statistically uniform then this halving should result in reduced precision and thus less and optimal compression.

Arithmetic Coder: Major problem with the CACM arithmetic coder is the arithmetic overflow in steps involving calculations of new high and low.

If the machine supports w-bit arithmetic. (Until recently, most systems supported maximum 32-bit arithmetic). Then, to avoid overflow, the no of bits used to store cumulative frequency counts ( f ) is restricted by w. It is required that w >= 2f + 1. Thus, (as w = 32) f cannot exceed 15, and as a result, sum of frequency counts in any given context can not be more than 32767.

If we use 64-bit integers, as w = 64, we can have f = 31. Thus sum of frequency counts can be as much as 2147483648, which is a value big enough for most practical purposes. As now don't have to rescale the frequencies, we have more accurate probability estimation. This reduces the loss and, as we will soon see, gives significantly better results.

Range Coder: The problem with range coder is roughly the same as it is with arithmetic coder, but things are slightly better here.

Range coder normalizes itself 1 byte at a time, i.e. 8-bits at a time. On a w-bit machine, range coder defines two constants TOP=1<<(w-8) and BOTTOM=1<<(TOP-8). To avoid overflow, the sum of frequencies of symbols in any given context should not exceed the value of BOTTOM. Thus, the no. of bits used to store the cumulative frequency counts (f) should not be greater than w-16. With w=32, f cannot exceed 16, and as a result, sum of frequency counts cannot be more than 65535. This is still a sever enough restriction for same reasons as above (it forces us to use approximations).

This again can be fixed using 64-bit arithmetic. With w=64, we can have f as large as 48. Thus sum of frequency counts can now be as much as 281474976710655 which is cool enough.

#### Results

This time also, order-zero model is used with each of the 4 encoders. In each case, the frequency counts are halved whenever total count reached the limit for that encoder.

• 32-bit Arithmetic Coder : 0x3FFF
• 64-bit Arithmetic Coder : 0x3FFF FFFF
• 32-bit Range Coder : 0xFFFF
• 64-bit Range Coder : 0xFFFF FFFF FFFF

The results on Calgary Corpus:

 File Name Original Size Size after Compression with Order-Zero Model 32-bit Arithmetic Coder 64-bit Arithmetic Coder 32-bit Range Coder 64-bit Range Coder BIB 111261 72791 72602 72662 72622 BOOK1 768771 436914 435399 435904 435501 BOOK2 610856 364778 366284 365517 366366 GEO 102400 72405 72442 72474 72462 NEWS 377109 244492 244940 244836 244996 OBJ1 21504 16039 16122 16128 16131 OBJ2 246814 187307 193337 191771 193376 PAPER1 53161 33130 33353 33370 33369 PAPER2 82199 47541 47543 47545 47560 PAPER3 46526 27392 27373 27388 27385 PAPER4 13286 7999 7999 8004 8007 PAPER5 11954 7561 7560 7565 7569 PAPER6 38105 23839 24088 24100 24100 PIC 513216 75060 77962 76907 77981 PROGC 39611 25923 25968 25980 25979 PROGL 71646 42616 42977 43000 42990 PROGP 49379 30208 30292 30308 30305 TRANS 93695 64341 65055 64947 65076

The results on Large Canterbury Corpus:

 File Name Original Size Size after Compression with Order-Zero Model 32-bit Arithmetic Coder 64-bit Arithmetic Coder 32-bit Range Coder 64-bit Range Coder E.coli 4638690 1176867 1160066 1165708 1160511 world192.txt 2473400 1545446 1545742 1543913 1546045 bible.txt 4047392 2202010 2197536 2196442 2197987

It has already been discussed that in adaptive models, if the nature of source data changes, then the halving of frequency counts can improve compression. Because in such cases it effectively translates into "reducing weight of statistics gathered till then and giving more importance to new statistics". However, when data is statistically uniform then this halving results in reduced precision and this less and optimal compression.

This observation is confirmed by the results obtained. For statistically uniform files, a marked improvement in compression ratio is observed with 64-bit entropy coders. Such files are marked in bold.

Conclusion: By using 64-bit encoders, our choice of when and how often to rescale the frequency counts in model can now depend on the data source, instead of being dictated upon by the restrictions of the entropy coder used.

#### Summary

Here is a summary of what all we have concluded from all the results.

• 64-bit implementations do improve the precision/accuracy of calculations in entropy coder, but the gains due to this are not significant enough for small data sets. On larger files, a definite improvement can be seen.
• Effect of using 64-bit implementation is more profound on range coder.
• For statistically uniform files, a marked improvement in compression ratio is observed with 64-bit entropy coders.
• By using 64-bit encoders, our choice of when and how often to rescale the frequency counts in model can now depend on the data source, instead of being dictated upon by the restrictions of the entropy coder used.
• With 64-bit processors becoming increasingly popular, it makes sense to move on and use the available processing power, as it will not result in any overhead in processing time.

#### Source Code

Here is the source code for the 64-bit Arithmetic coder and the 64-bit range coder used for this article. The archive includes C++ classes for the entropy coders and code for order-0 model which was used for testing the coders and gathering results. The archive also includes reference 32-bit  implementations of both arithmetic coder and range coder. You can browse the source tree here.

#### References

For arithmetic coding, I have used Mark Nelson's code which was published with his 1991 DDJ article. Other interesting papers on arithmetic coding include the 1987 CACM paper and Alistair Moffat's 1998 paper, Arithmetic Coding Revisited.

For range coder, I have used Dmitry Subbotin's carry-less implementation found in Dmitry Shkarin's fast PPMd code. Range coding was introduced by Martin in his 1979 paper.