Summary: This article explains the limitations faced by Arithmetic coder and
Range coder on 32-bit platforms and how using 64-bit variables can improve
the compression results significantly. With 64-bit processors becoming
increasingly popular, it makes sense to move on and use the available
processing power.
Effects of Precision on Entropy Coding
Conceptually both Arithmetic coding and Range coding are best described
using floating point mathematics. But when it comes to actually implementing
them for practical use, it turns out that this is best accomplished using
standard integer math. No floating point math is required, nor would it help
to use it. What is used instead is an incremental transmission scheme,
where fixed size integer state variables receive new bits in at the low end
and shift them out the high end, forming a single number that can be as many
bits long as are available on the computer's storage medium. (Read Mark
Nelson's article on arithmetic coding for detailed introduction to
incremental
transmission scheme, the same background applies to range coding too.)
The size of integer variables has two independent impacts:
- It directly effects the accuracy of calculations during entropy
encoding.
- It effects the accuracy with which the model can keep track of
probabilities
The limitation faced on 32-bit platforms can be solved using 64-bit arithmetic. Most current compilers and systems
allow use of 64-bit arithmetic (implemented as unsigned long long
or __int64) and 64-bit
processors are also becoming increasingly popular.
Here I will be trying to evaluate the effect of both these impacts. The
results are compared on
Calgary Corpus and
Large Canterbury Corpus,
using order-0 model with 4 entropy coder implementations:
- 32-bit arithmetic coder
- 64-bit arithmetic coder
- 32-bit range coder
- 64-bit range coder
Details of the implementations used are described where needed. The source
code is also available at the bottom of this page.
Effect of Precision on Accuracy of Calculations
in Entropy Coding
The size of the integer variables determines the number of bits which are
participating in the calculations at a given time. The size greatly effects the
precision/accuracy of calculations and effects the end result. Reduced
accuracy with 32-bit variables, as expected, leads to more than necessary bits being generated by
the encoder, thus adversely effecting the compression ratio achieved.
Accuracy can be improved by using 64-bit variables.
The results are compared on
Calgary
Corpus and
Large Canterbury Corpus, using order-0 model with 4 entropy coder implementations.
Besides providing a comparison of precision of 32-bit vs 64-bit
implementations, this also provides a comparison of Range coder vs
Arithmetic coder.
File Name |
Original Size |
Size after Compression with Order-Zero
Model |
|
|
32-bit Arithmetic Coder |
64-bit Arithmetic Coder |
32-bit Range Coder |
64-bit Range Coder |
BIB |
111261 |
72791 |
72791 |
72815 |
72813 |
BOOK1 |
768771 |
436914 |
436911 |
437063 |
437000 |
BOOK2 |
610856 |
364778 |
364776 |
364896 |
364852 |
GEO |
102400 |
72405 |
72404 |
72426 |
72424 |
NEWS |
377109 |
244492 |
244492 |
244571 |
244544 |
OBJ1 |
21504 |
16039 |
16039 |
16045 |
16048 |
OBJ2 |
246814 |
187307 |
187304 |
187357 |
187340 |
PAPER1 |
53161 |
33130 |
33130 |
33144 |
33144 |
PAPER2 |
82199 |
47541 |
47541 |
47557 |
47561 |
PAPER3 |
46526 |
27392 |
27391 |
27401 |
27404 |
PAPER4 |
13286 |
7999 |
7999 |
8004 |
8007 |
PAPER5 |
11954 |
7561 |
7560 |
7565 |
7569 |
PAPER6 |
38105 |
23839 |
23839 |
23848 |
23850 |
PIC |
513216 |
75060 |
75056 |
75100 |
75079 |
PROGC |
39611 |
25923 |
25923 |
25932 |
25932 |
PROGL |
71646 |
42616 |
42616 |
42633 |
42628 |
PROGP |
49379 |
30208 |
30208 |
30219 |
30221 |
TRANS |
93695 |
64341 |
64341 |
64364 |
64361 |
The results on
Large
Canterbury Corpus:
File Name |
Original Size |
Size after Compression with Order-Zero
Model |
|
|
32-bit Arithmetic Coder |
64-bit Arithmetic Coder |
32-bit Range Coder |
64-bit Range Coder |
E.coli |
4638690 |
1176867 |
1176867 |
1177665 |
1177304 |
world192.txt |
2473400 |
1545446 |
1545440 |
1545915 |
1545733 |
bible.txt |
4047392 |
2202010 |
2202002 |
2202798 |
2202477 |
Note: In each of the 4 encoders, the frequency counts in model were
rescaled to half when Sum of frequencies reached 0x3FFF. This is done
because, as will be discussed soon, the maximum range of frequencies which
the model can keep, is limited by the precision of integer variables used.
Here 32-bit Arithmetic Coder has the most restrictive "maximum range", hence
that range is used for all encoders.
As the gains achieved by using 64-bit implementations are very small on
Calgary corpus files and considering that the test files are small in size,
it
is worth mentioning here that 64-bit encoders have to flush 8 bytes at end,
whereas the 32-bit encoders have to flush on 4 bytes. The also explains the
slight inconsistency in results where using 64-bit implementation increased
the compressed size.
Conclusion: It can be concluded that using 64-bit implementations
do improve the precision/accuracy of calculations in entropy coder, but the
gains due to this are not significant enough for small data sets. On larger
files, a definite improvement can be seen. Also, effect of using 64-bit
implementation is more profound on range coder.
Effect of Entropy Coder Precision on Model
The size of the integer variables also determines the maximum precision
with which the model can represent the probabilities. For entropy coding,
once the character probabilities are known, the individual symbols need to
be assigned a range along a "probability line". The 'maximum range' that can
be used determines the accuracy with which we can represent the
probabilities. This 'maximum range' depends upon the size of the integer
variables used for entropy coding, in both arithmetic and range coder.
This is a sever limitation that forces us to use approximations,
frequency counts are rescaled (usually halved) whenever
they are near the limit. As now probability estimation is not accurate, the
number of bits generated is more than optimal.
In adaptive models, if the nature of source data changes, then this
halving of frequency counts can improve compression. Because in
such cases it effectively translates into "reduce weight of statistics
gathered till now and give more importance to new statistics". However, when
data is statistically uniform then this halving should result in reduced precision
and thus less and optimal compression.
Arithmetic Coder: Major problem with the CACM arithmetic coder is
the arithmetic overflow in steps involving
calculations of new high and
low.
If the machine supports w-bit arithmetic. (Until recently, most
systems supported maximum 32-bit arithmetic). Then, to avoid overflow, the no of bits
used to store cumulative frequency counts ( f ) is restricted by w.
It is required that w >= 2f + 1. Thus, (as w = 32) f
cannot exceed 15, and as a result, sum of frequency counts in any given
context can not be more than 32767.
If we use 64-bit integers, as w = 64, we can have
f
= 31. Thus sum of frequency counts can be as much as 2147483648, which is a value
big enough for most practical purposes. As now don't have to rescale the
frequencies, we have more accurate probability
estimation. This reduces the loss and, as we will soon see, gives
significantly better results.
Range Coder: The problem with range coder is roughly the same as it
is with
arithmetic coder, but things are slightly better here.
Range coder normalizes itself 1 byte at a time, i.e. 8-bits at a time. On
a w-bit machine, range coder defines two constants
TOP=1<<(w-8) and
BOTTOM=1<<(TOP-8). To avoid
overflow, the sum of frequencies of symbols in any given context should not
exceed the value of BOTTOM. Thus,
the no. of bits used to store the cumulative frequency counts (f)
should not be greater than w-16. With w=32, f cannot
exceed 16, and as a result, sum of frequency counts cannot be more than
65535. This is still a sever enough restriction for same reasons as above
(it forces us to use approximations).
This again can be fixed using 64-bit arithmetic. With w=64, we can
have f as large as 48. Thus sum of frequency counts can now be as
much as 281474976710655 which is cool enough.
Results
This time also, order-zero model is used with each of the 4 encoders. In
each case, the frequency counts are halved whenever total count reached the
limit for that encoder.
- 32-bit Arithmetic Coder : 0x3FFF
- 64-bit Arithmetic Coder : 0x3FFF FFFF
- 32-bit Range Coder : 0xFFFF
- 64-bit Range Coder : 0xFFFF FFFF FFFF
The results on
Calgary
Corpus:
File Name |
Original Size |
Size after Compression with Order-Zero
Model |
|
|
32-bit Arithmetic Coder |
64-bit Arithmetic Coder |
32-bit Range Coder |
64-bit Range Coder |
BIB |
111261 |
72791 |
72602 |
72662 |
72622 |
BOOK1 |
768771 |
436914 |
435399 |
435904 |
435501 |
BOOK2 |
610856 |
364778 |
366284 |
365517 |
366366 |
GEO |
102400 |
72405 |
72442 |
72474 |
72462 |
NEWS |
377109 |
244492 |
244940 |
244836 |
244996 |
OBJ1 |
21504 |
16039 |
16122 |
16128 |
16131 |
OBJ2 |
246814 |
187307 |
193337 |
191771 |
193376 |
PAPER1 |
53161 |
33130 |
33353 |
33370 |
33369 |
PAPER2 |
82199 |
47541 |
47543 |
47545 |
47560 |
PAPER3 |
46526 |
27392 |
27373 |
27388 |
27385 |
PAPER4 |
13286 |
7999 |
7999 |
8004 |
8007 |
PAPER5 |
11954 |
7561 |
7560 |
7565 |
7569 |
PAPER6 |
38105 |
23839 |
24088 |
24100 |
24100 |
PIC |
513216 |
75060 |
77962 |
76907 |
77981 |
PROGC |
39611 |
25923 |
25968 |
25980 |
25979 |
PROGL |
71646 |
42616 |
42977 |
43000 |
42990 |
PROGP |
49379 |
30208 |
30292 |
30308 |
30305 |
TRANS |
93695 |
64341 |
65055 |
64947 |
65076 |
The results on
Large
Canterbury Corpus:
File Name |
Original Size |
Size after Compression with Order-Zero
Model |
|
|
32-bit Arithmetic Coder |
64-bit Arithmetic Coder |
32-bit Range Coder |
64-bit Range Coder |
E.coli |
4638690 |
1176867 |
1160066 |
1165708 |
1160511 |
world192.txt |
2473400 |
1545446 |
1545742 |
1543913 |
1546045 |
bible.txt |
4047392 |
2202010 |
2197536 |
2196442 |
2197987 |
It has already been discussed that in adaptive models, if the nature of
source data changes, then the halving of frequency counts can
improve compression. Because in such cases it effectively translates
into "reducing weight of statistics gathered till then and giving more
importance to new statistics". However, when data is statistically uniform
then this halving results in reduced precision and this less and optimal
compression.
This observation is confirmed by the results obtained. For statistically
uniform files, a marked improvement in compression ratio is observed with
64-bit entropy coders. Such files are marked in bold.
Conclusion: By using 64-bit encoders, our choice of when and how
often to rescale the frequency counts in model can now depend on the data
source, instead of being dictated upon by the restrictions of the entropy
coder used.
Summary
Here is a summary of what all we have concluded from all the results.
- 64-bit implementations do improve the precision/accuracy of
calculations in entropy coder, but the gains due to this are not
significant enough for small data sets. On larger files, a definite
improvement can be seen.
- Effect of using 64-bit implementation is more profound on range
coder.
- For statistically uniform files, a marked improvement in compression
ratio is observed with 64-bit entropy coders.
- By using 64-bit encoders, our choice of when and how often to
rescale the frequency counts in model can now depend on the data source,
instead of being dictated upon by the restrictions of the entropy coder
used.
- With 64-bit processors becoming increasingly popular, it makes sense
to move on and use the available processing power, as it will not result
in any overhead in processing time.
Source Code
Here is the source code for the 64-bit Arithmetic coder and the 64-bit
range coder used for this article. The archive includes C++ classes for
the entropy coders and code for order-0 model which was used for testing the
coders and gathering results. The archive also includes reference 32-bit
implementations of both arithmetic coder and range coder.
You can browse the source tree here.
Download Source Code (22 KB)
References
For arithmetic coding, I have used Mark Nelson's code which was published
with his 1991 DDJ
article. Other
interesting papers on arithmetic coding include the
1987 CACM paper and Alistair Moffat's 1998 paper,
Arithmetic Coding Revisited.
For range coder, I have used Dmitry Subbotin's carry-less implementation
found in Dmitry Shkarin's fast
PPMd code. Range coding was introduced by Martin in
his 1979 paper.
|