Speech is analogue: In an analogue system, varying air pressures are captured by a microphone and delivered by an earpiece or loudspeaker, passing the signal between the two ends through an analogue connection. GSM mobile phone systems are digital: they pass data to and fro, so speech has to be encoded at the microphone end and decoded at the speaker end.
This page gives an overview of how the speech is encoded. There are three systems in use: Full rate (FR) (described here), Half Rate (HR), which increases capacity at the expense of audio quality, and Enhanced Full Rate (EFR) which improves sound quality with only a small processing overhead.
The coding system used is called Regular Pulse Excitation Long-Term Prediction (RPE-LTP). Basically, it uses previous samples to predict what the next sounds will be, and uses that as a basis for working out how best to turn it into data.
The handset chops the sound into 20ms samples, which are passed to the encoder, running at 13kbps. This means that the result is 260 bits of sampled data.
It chooses the most important 50 bits and encodes them with 3 parity bits for error correction. The next 132 bits are added without parity bits, and the result is encoded before the least important 78 bits are added.
At the other end, if the important Top 50 data bits are corrupted, they are discarded, and the Top 50 from the previous data burst are reused instead. This is what causes the metallic twang echo sound of a poor GSM connection. Better than no sound at all, though!
This 456bit long block of data, representing 20ms of sound, is then split up and shared across four pairs of 57bit data bursts. By being interleaved in this way, lost data will make a section fuzzy instead of losing the whole of a smaller section.
The data is encrypted before being sent. See the Encryption page for details of this.
The data is sent over the radio link using a modulation system called Gaussian Modulation Shift Keying (GMSK). This belongs in the air interface as much as it does here, but the mathematics involved are enough to make your head spin, and it isn’t explained further in either place. Sorry.
A similar process goes on at the other end to reverse the coding and restore audio tones.
All this processing and interleaving causes a time delay, and unless measures are taken to prevent it, there can be a problem with echo. Handsets are designed so that they do not pass sound from the earpiece to the microphone, and there are echo-suppressors built into the network, but they can only do so much. If echo is a problem, it is often because the earpiece volume is set too high, or because a phone case is reflecting earpiece sound back to the microphone.