In Daniel Jurafsky and James H. Martin's book Speech and Language Processing 2nd edition, page 317, the concept of Word Insertion Penalty has been introduced. But it's very confusing to me. The original description is
(on average) the language model probability decreases (causing a larger penalty), the decoder will prefer fewer, longer words. If the language model probability increases (larger penalty), the decoder will prefer more shorter words.
Pay attention that larger penaly appeared twice in two opposite conditions: when language model probability decreases and increases. This is not logical and I guess it's a typo of this book.
My understanding is:
If the language model probability doesn't take enough weight (importance), the decoder will prefer more shorter words since the multiplier of the language probabilities of all of the words would be small compared to acoustic probabilities. The language model scaling factor does reduce the weight of language model so that the Word Insertion Penalty should be introduced to avoid many insertions.