DATA
COMPRESSION STATISTICS AND IMPLICATIONS
By Dr.
Sheila Horan
New
Mexico State University
KEY
WORDS: Lossless data compression, Bandwidth efficiency,
Data compression
Sheila
Horan is a College Associate Professor in the Klipsch School of
Electrical and Computer Engineering, Las Cruces, NM. She has conducted short
courses on Data Compression, and has reviewed texts in the area of Data
compression. She has been working on Bandwidth Efficient techniques for ARTM,
and the CCSDS. Her efforts are currently focused on academic work with ABET
accreditation, freshmen advising, and data compression.
ABSTRACT
Bandwidth
is a precious commodity. In order to make the best use of what is available,
better modulation schemes need to be developed, or less data needs to be sent.
This paper will investigate the option of sending less data via data
compression. The structure and the entropy of the data determine how much
lossless compression can be obtained for a given set of data. This paper shows
the data structure and entropy for several actual telemetry data sets and the
resulting lossless compression obtainable using data compression techniques.
INTRODUCTION
One
of the main topics for discussion in the modulation area of communications
lately seems to be frequency allocation and the protection of spectrum. The
importance of this topic can be seen in the formation of ARTM (the Advanced
Range Telemetry group), which is tasked to "evaluate technology which can
potentially improve the efficiency of aeronautical telemetry utilization. The
technologies which emerge from this project with the greatest potential will be
integrated into new range operational capabilities through the ARTM
program." [1]. One of the areas of concern for ARTM is bandwidth efficient
modulation. Again, in publications like Satellite News [2], the importance of
spectrum and preserving and conserving spectrum is evident. There are two ways
to make the optimum use of the spectra that is available. One way is to change
the modulation of the signals by using higher orders of modulation, or more
efficient modulation techniques. The other way is to reduce the amount of data
to be sent either by taking less data, or by compressing the data.
Entropy
is a measure of the information content of data. The entropy of the data will
specify the amount of lossless data compression that can be achieved. However,
finding the entropy of data sets is non-trivial [3]. Approximations to the
entropy can be obtained by calculating various orders of entropy. The actual
entropy can be approximated as the limit as n approaches infinity of the nth
order entropy [3].
Many
lossless data compression algorithms exist. Some of the main techniques in use
are the Huffman [4], Arithmetic [3], Lempel-Ziv [5], runlength, predictive
coding or variations and combinations of these. Each of these methods can be
found in most data compression texts. The Huffman code is a very efficient code,
that is built using variable length codes where the least probable symbol is
assigned the longest codeword and the most probable symbol is assigned the
smallest codeword. The Arithmetic code is the mapping of several symbols to a
specific region on the number line. To code the sequence, one simply sends a
numerical value from inside the specific region of the number line. Lempel-Ziv
is a dictionary based code which uses information from what has been sent to
code what is being sent. Runlength codes are used to code long runs of single
symbols or long runs of strings of symbols. Predictive coding can take many
forms from simply taking the difference between symbols to modeling the
underlying physical process generating the data. The trick to coding is finding
what works best with the data that needs to be compressed.
TEST
DATA
Thirteen
actual telemetry data files were obtained from the Advanced Range Telemetry (ARTM)
group. The data were then analyzed to find the 4 bit, 8 bit, and 12 bit
entropies. The entropy for a set of data is given by:
(1)
Where
·
H
(S) is the entropy of the source (the data)
·
Xi
are the elements of all possible sequences of the data. The Xi can be
bits, symbols, or words.
·
n
= the length of the sequence
·
P
is the probability
·
m
= the size of the alphabet used for coding; for binary data, m=2, for words of 4
bits in length, m=16, etc.
To
find the 4-bit entropy, all possible combinations of four elements from the
binary alphabet were found. The frequency of occurrence of each of these 4 bit
words was found. This frequency was used as an estimate of the probability of
occurrence for each of these words. From these probabilities, the estimate for
the 4 bit entropy was found using equation 1 where n=4 and m=2. For the 8 bit
entropy, 256 different words are possible, and for the 12 bit entropy, 4096
words are possible. It can be seen that letting n approach infinity will quickly
become impractical. The formats for each data set varied. Some of the data were
coded into words of 10 bits, some 12 bits, etc. A summary of the data statistics
is given in Table 1.
|
Table
1. Data Set Entropies and Word Size |
||||
|
Data
Set |
Data
word size In
bits |
4
bit Entropy |
8
bit Entropy |
12
bit Entropy |
|
SDS001 |
12 |
3.9 |
7.5 |
10.3 |
|
SDS002 |
12 |
3.3 |
6.0 |
7.5 |
|
SDS003 |
12 |
3.8 |
7.0 |
9.0 |
|
SDS004 |
10 |
3.4 |
5.9 |
7.3 |
|
SDS005 |
10 |
3.6 |
6.6 |
8.6 |
|
SDS006 |
10 |
3.7 |
5.9 |
7.6 |
|
SDS007 |
10 |
2.9 |
5.9 |
7.5 |
|
SDS008 |
12 |
2.6 |
4.2 |
4.7 |
|
SDS009 |
12 |
3.2 |
5.8 |
6.9 |
|
SDS010 |
12 |
2.7 |
4.3 |
4.8 |
|
SDS011 |
12 |
2.8 |
4.9 |
5.9 |
|
SDS012 |
16 |
2.5 |
3.7 |
5.2 |
|
SDS013 |
24 |
2.8 |
3.7 |
4.6 |
DATA
COMPRESSION RESULTS
To
determine the amount of compression that should be possible, the entropy per
number of bits must be used. Hence % compression for k bit entropies = (1-(k bit
entropy / k))*100. The amount of compression for each data set can then be
found. If the k bit entropy is equal to the entropy, then the predicted
compression will be a lower bound for all data compression techniques. A plot of
the predicted compression from the calculated entropies is given in Figure 1. It
can be seen that the compression increases as the number of bits per symbol is
increased. This indicates that the entropies do not yet equal the entropy for
the data sets. Consequently, compression techniques should be able to achieve
values better than these predicted values.
The
Huffman, an Adaptive Huffman, Arithmetic, and Lempel-Ziv compression algorithms
were applied using programs from Mark Nelson’s text [6]. The Huffman codes and
Arithmetic codes use an 8-bit word length in the code. Two variations of the
Lempel-Ziv algorithm were used. The LZSS uses a pair of values to indicate the
location of the match and the length of the match in the dictionary. The LZW
algorithm involves only sending one element instead of a pair of elements, and
using a start up alphabet in the dictionary consisting of all the letters of the
source alphabet. The percent compression for each technique along with the
predicted compression is given in Table 2.
|
Table
2. Data Compression Results |
||||||||
|
Data
Set |
Huffman |
Adapt.
Huffman |
LZSS |
LZW |
Arith- metic |
%
4 bit Predicted compress |
%
8 bit Predicted compress |
%
12 bit predicted compress |
|
SDS001 |
6.3 |
6.9 |
15.5 |
-9.9 |
6.6 |
2.5 |
6.3 |
14.2 |
|
SDS002 |
24.1 |
25.5 |
61.1 |
33.8 |
24.7 |
17.5 |
25.0 |
37.5 |
|
SDS003 |
11.5 |
15.5 |
39.7 |
-7.3 |
11.9 |
5.0 |
12.5 |
25.0 |
|
SDS004 |
25.1 |
27.7 |
50.2 |
11.5 |
25.5 |
15.0 |
26.3 |
39.2 |
|
SDS005 |
18.5 |
19.0 |
34.7 |
18.5 |
18.7 |
10.0 |
17.5 |
28.3 |
|
SDS006 |
25.1 |
25.9 |
40.2 |
32.5 |
25.4 |
7.5 |
26.3 |
36.7 |
|
SDS007 |
25.9 |
26.5 |
41.9 |
28.2 |
41.9 |
27.5 |
26.3 |
37.5 |
|
SDS008 |
46 |
48.3 |
71.0 |
53.9 |
46.7 |
35.0 |
47.5 |
60.8 |
|
SDS009 |
27.5 |
29.4 |
64.7 |
26.3 |
27.7 |
20.0 |
27.5 |
42.5 |
|
SDS010 |
45.2 |
47.5 |
68.9 |
52.7 |
45.9 |
32.5 |
46.3 |
60.0 |
|
SDS011 |
37.6 |
39.7 |
64.5 |
46.0 |
38.1 |
30.0 |
38.8 |
50.8 |
|
SDS012 |
51.9 |
51.9 |
65.9 |
63.7 |
51.9 |
37.5 |
53.8 |
56.7 |
|
SDS013 |
48.6 |
53.7 |
65.5 |
64.8 |
51.4 |
30.0 |
53.8 |
61.7 |
A
negative with the compression value indicates that the file was expanded instead
of compressed. It can be observed that each data file compressed differently.
Figure 2 contains the plot of the results of the compression techniques. It can
be seen from Figure 2 that certain files will compress very well. A 60%
compression would mean that the file would take up less than half its original
size. In all cases, there is at least one technique that provides compression of
10% or more. With the demands on spectra, even this little gain can be worth
something. Since the Huffman and Arithmetic codes that were tried work with 8
bit word sizes, if the compression obtained by these techniques is compared with
the 8 bit entropy, we see a very close match. All but 2 of the results are
within 10% or less of the 8 bit entropy. Since the LZ algorithms achieve larger
compression than this indicates, the 8 bit word size is not the best choice and
that this 8 bit entropy is not the entropy. Also, since the k bit entropies
continue to increase with k, this also indicates that the actual entropy has yet
to be found. Hence even larger compressions can be expected.
CONCLUSION
Data
compression is a viable possibility to aid in optimal use of frequency spectra.
Less data transmitted translates into less bandwidth necessary to transmit the
data.
Further
work is necessary. Making use of the data structure of the data sets can result
in the use of prediction techniques, which can provide more compression. Work in
this area is needed. Also, analysis of how data performs in the channel will
also determine its ultimate usefulness. This work will continue by simulating
the data compression in aeronautical channels.
ACKNOWLEDGMENTS
The
author gratefully acknowledges ARTM and the Air Force Office of Scientific
Research (AFOSR) for funding this project. The author also credits James Woods,
an undergraduate at New Mexico State University, for running and developing the
programs used in this paper.
REFERENCES
[1]
Irving, Chuck, "Advanced Range Telemetry (ARTM)
Concept
Exploration", Air Force Flight Test Center Edwards AFB, CA, October 6, 1997
[2]
Satellite News, Vol. 22, Issue 14, April 5, 1999
[3]
Sayood, K., Data Compression, Morgan Kaufman Publishers, 1997
[4]
Huffman, D. A., "A Method for the construction of minimum redundancy
codes". Proceedings IRE, 40, 1951, pages 1098-1101
[5]
Ziv, J. and Lempel, A., "A universal algorithm for data compression". IEEE
Transactions on Information Theory, IT-23(3), May 1977, pages 337-343
[6] Nelson, M., The Data Compression Book, M&T Books, California, 1987.