Referring to the milestone paper "Gradient-Based Learning Applied to Document Recognition" of LeCun, Bottou, Bengio and Haffner, which parameters of the graph transformer networks for global training are tuned by back propagation?
The part related to backpropagation of GTN is mainly at page 22 and 24 of the reference.
I have the feeling that only the weights of the convnet in the recognition transformer are actually tuned. It looks to me, that the weights of the interpretation path, of the path selector, of the viterbi transformer and the weights that determines the loss function are entirely determined by the results of the recognition transformer, and they do not have weights that are iteratively trained.
Does their back-propagation only produce a penalty value that is applied to the recognition transformer output?
The recognition transformer weights are back-propagated by simply affecting the output units by the penalty calculated from the upper layers?