64-bit Range Coding and Arithmetic Coding

Summary: This article explains the limitations faced by Arithmetic coder and Range coder on 32-bit platforms and how using 64-bit variables can improve the compression results significantly. With 64-bit processors becoming increasingly popular, it makes sense to move on and use the available processing power.

Effects of Precision on Entropy Coding

Conceptually both Arithmetic coding and Range coding are best described using floating point mathematics. But when it comes to actually implementing them for practical use, it turns out that this is best accomplished using standard integer math. No floating point math is required, nor would it help to use it. What is used instead is an incremental transmission scheme, where fixed size integer state variables receive new bits in at the low end and shift them out the high end, forming a single number that can be as many bits long as are available on the computer's storage medium. (Read Mark Nelson's article on arithmetic coding for detailed introduction to incremental transmission scheme, the same background applies to range coding too.)

The size of integer variables has two independent impacts:

It directly effects the accuracy of calculations during entropy encoding.
It effects the accuracy with which the model can keep track of probabilities

The limitation faced on 32-bit platforms can be solved using 64-bit arithmetic. Most current compilers and systems allow use of 64-bit arithmetic (implemented as unsigned long long or __int64) and 64-bit processors are also becoming increasingly popular.

Here I will be trying to evaluate the effect of both these impacts. The results are compared on Calgary Corpus and Large Canterbury Corpus, using order-0 model with 4 entropy coder implementations:

32-bit arithmetic coder
64-bit arithmetic coder
32-bit range coder
64-bit range coder

Details of the implementations used are described where needed. The source code is also available at the bottom of this page.

Effect of Precision on Accuracy of Calculations in Entropy Coding

The size of the integer variables determines the number of bits which are participating in the calculations at a given time. The size greatly effects the precision/accuracy of calculations and effects the end result. Reduced accuracy with 32-bit variables, as expected, leads to more than necessary bits being generated by the encoder, thus adversely effecting the compression ratio achieved. Accuracy can be improved by using 64-bit variables.

The results are compared on Calgary Corpus and Large Canterbury Corpus, using order-0 model with 4 entropy coder implementations. Besides providing a comparison of precision of 32-bit vs 64-bit implementations, this also provides a comparison of Range coder vs Arithmetic coder.

File Name	Original Size	Size after Compression with Order-Zero Model
		32-bit Arithmetic Coder	64-bit Arithmetic Coder	32-bit Range Coder	64-bit Range Coder
BIB	111261	72791	72791	72815	72813
BOOK1	768771	436914	436911	437063	437000
BOOK2	610856	364778	364776	364896	364852
GEO	102400	72405	72404	72426	72424
NEWS	377109	244492	244492	244571	244544
OBJ1	21504	16039	16039	16045	16048
OBJ2	246814	187307	187304	187357	187340
PAPER1	53161	33130	33130	33144	33144
PAPER2	82199	47541	47541	47557	47561
PAPER3	46526	27392	27391	27401	27404
PAPER4	13286	7999	7999	8004	8007
PAPER5	11954	7561	7560	7565	7569
PAPER6	38105	23839	23839	23848	23850
PIC	513216	75060	75056	75100	75079
PROGC	39611	25923	25923	25932	25932
PROGL	71646	42616	42616	42633	42628
PROGP	49379	30208	30208	30219	30221
TRANS	93695	64341	64341	64364	64361

The results on Large Canterbury Corpus:

File Name	Original Size	Size after Compression with Order-Zero Model
		32-bit Arithmetic Coder	64-bit Arithmetic Coder	32-bit Range Coder	64-bit Range Coder
E.coli	4638690	1176867	1176867	1177665	1177304
world192.txt	2473400	1545446	1545440	1545915	1545733
bible.txt	4047392	2202010	2202002	2202798	2202477

Note: In each of the 4 encoders, the frequency counts in model were rescaled to half when Sum of frequencies reached 0x3FFF. This is done because, as will be discussed soon, the maximum range of frequencies which the model can keep, is limited by the precision of integer variables used. Here 32-bit Arithmetic Coder has the most restrictive "maximum range", hence that range is used for all encoders.

As the gains achieved by using 64-bit implementations are very small on Calgary corpus files and considering that the test files are small in size, it is worth mentioning here that 64-bit encoders have to flush 8 bytes at end, whereas the 32-bit encoders have to flush on 4 bytes. The also explains the slight inconsistency in results where using 64-bit implementation increased the compressed size.

Conclusion: It can be concluded that using 64-bit implementations do improve the precision/accuracy of calculations in entropy coder, but the gains due to this are not significant enough for small data sets. On larger files, a definite improvement can be seen. Also, effect of using 64-bit implementation is more profound on range coder.

Effect of Entropy Coder Precision on Model

The size of the integer variables also determines the maximum precision with which the model can represent the probabilities. For entropy coding, once the character probabilities are known, the individual symbols need to be assigned a range along a "probability line". The 'maximum range' that can be used determines the accuracy with which we can represent the probabilities. This 'maximum range' depends upon the size of the integer variables used for entropy coding, in both arithmetic and range coder.

This is a sever limitation that forces us to use approximations, frequency counts are rescaled (usually halved) whenever they are near the limit. As now probability estimation is not accurate, the number of bits generated is more than optimal.

In adaptive models, if the nature of source data changes, then this halving of frequency counts can improve compression. Because in such cases it effectively translates into "reduce weight of statistics gathered till now and give more importance to new statistics". However, when data is statistically uniform then this halving should result in reduced precision and thus less and optimal compression.

Arithmetic Coder: Major problem with the CACM arithmetic coder is the arithmetic overflow in steps involving calculations of new high and low.

If the machine supports w-bit arithmetic. (Until recently, most systems supported maximum 32-bit arithmetic). Then, to avoid overflow, the no of bits used to store cumulative frequency counts ( f ) is restricted by w. It is required that w >= 2f + 1. Thus, (as w = 32) f cannot exceed 15, and as a result, sum of frequency counts in any given context can not be more than 32767.

If we use 64-bit integers, as w = 64, we can have f = 31. Thus sum of frequency counts can be as much as 2147483648, which is a value big enough for most practical purposes. As now don't have to rescale the frequencies, we have more accurate probability estimation. This reduces the loss and, as we will soon see, gives significantly better results.

Range Coder: The problem with range coder is roughly the same as it is with arithmetic coder, but things are slightly better here.

Range coder normalizes itself 1 byte at a time, i.e. 8-bits at a time. On a w-bit machine, range coder defines two constants TOP=1<<(w-8) and BOTTOM=1<<(TOP-8). To avoid overflow, the sum of frequencies of symbols in any given context should not exceed the value of BOTTOM. Thus, the no. of bits used to store the cumulative frequency counts (f) should not be greater than w-16. With w=32, f cannot exceed 16, and as a result, sum of frequency counts cannot be more than 65535. This is still a sever enough restriction for same reasons as above (it forces us to use approximations).

This again can be fixed using 64-bit arithmetic. With w=64, we can have f as large as 48. Thus sum of frequency counts can now be as much as 281474976710655 which is cool enough.

Results

This time also, order-zero model is used with each of the 4 encoders. In each case, the frequency counts are halved whenever total count reached the limit for that encoder.

32-bit Arithmetic Coder : 0x3FFF
64-bit Arithmetic Coder : 0x3FFF FFFF
32-bit Range Coder : 0xFFFF
64-bit Range Coder : 0xFFFF FFFF FFFF

The results on Calgary Corpus:

File Name	Original Size	Size after Compression with Order-Zero Model
		32-bit Arithmetic Coder	64-bit Arithmetic Coder	32-bit Range Coder	64-bit Range Coder
*BIB*	111261	72791	72602	72662	72622
*BOOK1*	768771	436914	435399	435904	435501
BOOK2	610856	364778	366284	365517	366366
GEO	102400	72405	72442	72474	72462
NEWS	377109	244492	244940	244836	244996
OBJ1	21504	16039	16122	16128	16131
OBJ2	246814	187307	193337	191771	193376
PAPER1	53161	33130	33353	33370	33369
PAPER2	82199	47541	47543	47545	47560
*PAPER3*	46526	27392	27373	27388	27385
PAPER4	13286	7999	7999	8004	8007
PAPER5	11954	7561	7560	7565	7569
PAPER6	38105	23839	24088	24100	24100
PIC	513216	75060	77962	76907	77981
PROGC	39611	25923	25968	25980	25979
PROGL	71646	42616	42977	43000	42990
PROGP	49379	30208	30292	30308	30305
TRANS	93695	64341	65055	64947	65076

The results on Large Canterbury Corpus:

File Name	Original Size	Size after Compression with Order-Zero Model
		32-bit Arithmetic Coder	64-bit Arithmetic Coder	32-bit Range Coder	64-bit Range Coder
*E.coli*	4638690	1176867	1160066	1165708	1160511
world192.txt	2473400	1545446	1545742	1543913	1546045
*bible.txt*	4047392	2202010	2197536	2196442	2197987

It has already been discussed that in adaptive models, if the nature of source data changes, then the halving of frequency counts can improve compression. Because in such cases it effectively translates into "reducing weight of statistics gathered till then and giving more importance to new statistics". However, when data is statistically uniform then this halving results in reduced precision and this less and optimal compression.

This observation is confirmed by the results obtained. For statistically uniform files, a marked improvement in compression ratio is observed with 64-bit entropy coders. Such files are marked in bold.

Conclusion: By using 64-bit encoders, our choice of when and how often to rescale the frequency counts in model can now depend on the data source, instead of being dictated upon by the restrictions of the entropy coder used.

Summary

Here is a summary of what all we have concluded from all the results.

64-bit implementations do improve the precision/accuracy of calculations in entropy coder, but the gains due to this are not significant enough for small data sets. On larger files, a definite improvement can be seen.
Effect of using 64-bit implementation is more profound on range coder.

For statistically uniform files, a marked improvement in compression ratio is observed with 64-bit entropy coders.
By using 64-bit encoders, our choice of when and how often to rescale the frequency counts in model can now depend on the data source, instead of being dictated upon by the restrictions of the entropy coder used.

With 64-bit processors becoming increasingly popular, it makes sense to move on and use the available processing power, as it will not result in any overhead in processing time.

Source Code

Here is the source code for the 64-bit Arithmetic coder and the 64-bit range coder used for this article. The archive includes C++ classes for the entropy coders and code for order-0 model which was used for testing the coders and gathering results. The archive also includes reference 32-bit implementations of both arithmetic coder and range coder. You can browse the source tree here.

Download Source Code (22 KB)

References

For arithmetic coding, I have used Mark Nelson's code which was published with his 1991 DDJ article. Other interesting papers on arithmetic coding include the 1987 CACM paper and Alistair Moffat's 1998 paper, Arithmetic Coding Revisited.

For range coder, I have used Dmitry Subbotin's carry-less implementation found in Dmitry Shkarin's fast PPMd code. Range coding was introduced by Martin in his 1979 paper.

Sachin Garg