Due to both the digitisation of most communication channels and an
ever-increasing demand for mobile communication services, the amount
of traffic generated by digitised speech signals continues to grow
rapidly. To accommodate this increased traffic load using the finite
bandwidth available for speech communication channels, it is necessary
to develop speech compression algorithms that can dynamically scale to
traffic and user demands. These scalable compression algorithms must
be capable of dynamically altering the bit rate required for
transmission, whist smoothly and gradually varying the synthesized
speech subjective quality with the changes in bit rate. To further
increase the throughput of the communication channel, the scalable
algorithm should operate in the lower range of bit rates currently
used for speech compression (i.e. 2-8kbps).
We propose a number of scalable speech coding techniques that lead to
the development of a single coding algorithm that is capable of
scalable operation. Firstly, via a thorough review of current
literature, the characteristics of existing speech compression
algorithms that limit scalable operation between bit rates of 2 and
8kbps are identified. The major limiting characteristics are
identified as the existence of a distinct barrier at 4kbps where
parametric coders dominate below and waveform coders dominate above
the large delay requirements for current low rate coding algorithms.
A method that exploits the simultaneous masking property of the human
ear in a linear predictive filter is proposed. The proposed method
modifies the linear predictive filter to remove more of the
perceptually important information from the input signal than a
standard linear predictive filter. This characteristic is shown to
improve the subjective speech quality of low-rate linear prediction
based speech coders.
To enable the pitch cycle redundancies of the speech signal to be
exploited in the coding algorithm, without introducing excessive
algorithmic delay, a novel low delay method for segmenting the speech
into non-overlapped pitch length subframes is proposed. This method
requires only a single frame of speech and locates the pitch pulses by
selecting the pulse locations in a closed loop function. The proposed
segmentation is shown to produce a much more accurate pitch track in
transient section of the speech signal, than the pitch track produced
by traditional autocorrelation based pitch detectors. Also, as the
pitch length subframes are not overlapped, the segmentation supports
closed loop analysis by synthesis modeling of the signal. A number of
Low delay decomposition techniques are proposed which decompose the
speech into perceptually different components and allow scalable
reconstruction of the speech signal. The preferred technique performs
the decomposition in a closed loop function, which allows quantisation
errors to be accounted for in the decomposition process.
The proposed scalable techniques are combined to produce a scalable
algorithm that operates at a range of bit rates from 2-8kbps. The
proposed algorithm produces synthesized speech whose subjective
quality varies in a perceptually meaningful manner, as the operating
bit rate is varied. A key feature of the proposed algorithm is the
ability to merge from a time asynchronous parametric coder at low
rates, to a time synchronous waveform coder at higher bit rates. The
coder also requires only a single frame of algorithmic delay (30ms)
for operation. Subjective results presented indicate that the
scalable coder produces subjective speech quality that is comparable
with that achieved for fixed rate standardized coders at each of the
tested bit rates.