Your question is answered in the original paper:
The convolution step always takes a smaller input than the feature maps of the previous layer (and this holds true for the 1st layer - the input - as well):
Layer C1 is a convolutional layer with 6 feature maps.
Each unit in each feature map is connected to a 5x5 neighborhood in the input. The size of the feature maps is 28x28
which prevents connection from the input from falling off
This means that using a 5x5 neighborhood on a 32x32 input, you'll get 6 features maps of size 28x28 because there's pixels you won't use at the image boundary (you will always have a remainder with these numbers).
Of course they could have an exception for the first layer. The reason they're still using 32x32 images is:
The input is a 32x32 pixel image. This is significantly larger
than the largest character in the database (at most 20x20
pixels centered in a 28x28 field). The reason is that it is
desirable that potential distinctive features such as stroke
end-points or corner can appear in the center of the receptive field of the highest-level feature detectors.