This paper proposes a mixed precision integer quantization method for the layer of a trained dialogue summarization model, which uses a more accurate method based on augmented Hessian matrix to evaluate layer sensitivity, considering inter-layer correlation while effectively preserving intra-layer information. Due to the large number of outliers in the conversation summarization model activation, quantifying them directly will result in significant decrease in model accuracy. In this paper, we propose a method of smoothing first and then mixed precision quantization. Activation outliers are first translated to weights using smoothing factors, and then weights and activations are quantified with mixed precision based on sensitivity assessment scores. This method can quantize the model from 16 bit or 32 bit floating-point to 4-16 bit mixed precision, which reduces the storage requirement of the model and accelerates the inference speed. On the benchmark dataset SAMSum, a significant performance improvement is achieved compared with the classical baseline system, which is almost the same as the performance of the non-quantized model.
DettmersT, LewisM, BelkadaY, et al. GPT3. int8 (): 8-bit matrix multiplication for transformers at scale[J]. Advances in Neural Information Processing Systems,2022,35: 30318-30332.
[2]
ZafrirO, BoudoukhG, IzsakP, et al. Q8BERT: Quantized 8bit bert[C]∥Proceedings of the Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, Vancouver, Canada, 2019: 36-39.
[3]
BaiH L, HouL, ShangL F, et al. Towards efficient post-training quantization of pre-trained language models[J]. Advances in Neural Information Processing Systems, 2022, 35: 1405-1418.
[4]
LiY H, GongR H, TanX, et al. BRECQ: pushing the limit of post-training quantization by block reconstruction[C]∥International Conference on Learning Representations, Vienna, Austria, 2021, 2: 1-17.
[5]
WangK, LiuZ, LinY, et al. HAQ: hardware-aware automated quantization with mixed precision[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 8612-8620.
[6]
SchaeferC J S, JoshiS, LiS,et al. Edge inference with fully differentiable quantized mixed precision neural networks[C]∥Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2024: 8460-8469.
[7]
ParkE, YooS. PROFIT: a novel training method for sub-4-bit mobilenet models[C]∥Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 2020: 430-446.
[8]
YaoZ W, DongZ, ZhengZ C, et al. HAWQ-V3: dyadic neural network quantization[C]∥Proceedings of the International Conference on Machine Learning, Shenzhen, China, 2021: 11875-11886.
[9]
NahshanY, ChmielB, BaskinC, et al. Loss aware post-training quantization[J]. Machine Learning, 2021, 110(11-12): 3245-3262.
[10]
NagelM, AmjadR A, VanB M, et al. Up or down? Adaptive rounding for post-training quantization[C]∥Proceedings of the International Conference on Machine Learning, Vienna, Austria, 2020: 7197-7206.
[11]
YaoZ W, AminabadiR Y, ZhangM J, et al. ZeroQuant: efficient and affordable post-training quantization for large-scale transformers[J]. Advances in Neural Information Processing Systems, 2022, 35: 27168-27183.
[12]
CaiY, YaoZ, DongZ, et al. Zeroq: a novel zero shot quantization framework[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 13169-13178.
[13]
HubaraI, NahshanY, HananiY, et al. Accurate post training quantization with small calibration sets[C]∥Proceedings of the International Conference on Machine Learning, Online, 2021: 4466-4475.
[14]
FrantarE, AlistarhD. Optimal brain compression: a framework for accurate post-training quantization and pruning[J]. Advances in Neural Information Processing Systems, 2022, 35: 4475-4488.
[15]
DemidovskijA, SmirnovE.Effective post-training quantization of neural networks for inference on low power neural accelerator[C]∥International Joint Conference on Neural Networks, Glasgow, UK, 2020: 1-7.
[16]
ZandonatiB, PolA A, PieriniM, et al. Fit: a metric for model sensitivity[C]∥Proceedings of ICLR, Addis Ababa, Ethiopia, 2022: 1-20.
[17]
ZhengD, LiuY, LiL. Leveraging inter-layer dependency for post-training quantization[J]. Advances in Neural Information Processing Systems, 2022, 35: 6666-6679.
[18]
WeiX Y, ZhangY C, ZhangX G, et al. Outlier suppression: pushing the limit of low-bit transformer language models[J]. Advances in Neural Information Processing Systems, 2022, 35: 17402-17414.
[19]
BondarenkoY, NagelM, BlankevoortT. Understanding and overcoming the challenges of efficient transformer quantization[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021: 7947-7969.
[20]
ZengA, LiuX, DuZ, et al. GLM-130B: an open bilingual pre-trained model[C]∥Proceedings of ICLR, Kigali, Rwanda, 2023: 1-56.
[21]
WuH, JuddP, ZhangX, et al. Integer quantization for deep learning inference: principles and empirical evaluation[J/OL].[2024-07-06]. arXiv Preprint arXiv:
[22]
FabbriA, RahmanF, RizviI, et al. ConvoSumm: conversation summarization benchmark and improved abstractive summarization with argument mining[C]∥Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 2021: 6866-6880.
[23]
LewisM, LiuY, GoyalN, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistic, Florence, Italy, 2019: 7871-7880.
[24]
LiuY P, ZhangY H, LiuG. A conversation summary generation method for medical consultations[P]. China Patent: ZL115964475A, 2023-04-14.
[25]
ShenS, DongZ, YeJ Y, et al. Q-BERT: hessian based ultra low precision quantization of bert[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, New York, USA, 2020: 8815-8821.
[26]
WeiX Y, GongR H, LiY H, et al. QDrop: randomly dropping quantization for extremely low-bit post-training quantization[C]∥Proceedings of ICLR, Online, 2022: 1-19.
[27]
JeonY, LeeC, ChoE, et al. Mr. BIQ: post-training non-uniform quantization based on minimizing the reconstruction error[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 12319-12328.
[28]
GooC W, ChenY N. Abstractive dialogue summarization with sentence-gated modeling optimized by dialogue acts[C]∥Proceedings of the IEEE Spoken Language Technology Workshop, Athens, Greece, 2018: 735-742.