Filters








1,494 Hits in 7.6 sec

Rank and run-time aware compression of NLP Applications [article]

Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew Mattina
2020 arXiv   pre-print
Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints.  ...  Sequence model based NLP applications can be large.  ...  schemes provides the best run-time at highest compression factor. measuring the run-time of an application.  ... 
arXiv:2010.03193v1 fatcat:hwftkhffl5b4tnigr5ugfla4ty

Rank and run-time aware compression of NLP Applications

Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew Mattina
2020 Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing   unpublished
Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints.  ...  Sequence model based NLP applications can be large.  ...  schemes provides the best run-time at highest compression factor. measuring the run-time of an application.  ... 
doi:10.18653/v1/2020.sustainlp-1.2 fatcat:avtfr34sdbep3nlnvgylpnytaq

DRONE: Data-aware Low-rank Compression for Large NLP Models

Patrick H. Chen, Hsiang-Fu Yu, Inderjit S. Dhillon, Cho-Jui Hsieh
2021 Neural Information Processing Systems  
Based on this observation, we propose DRONE (data-aware low-rank compression), a provably optimal low-rank decomposition of weight matrices, which has a simple closed form solution that can be efficiently  ...  Specifically, most operations in BERT consist of matrix multiplications. These matrices are not low-rank and thus canonical matrix decompositions do not lead to efficient approximations.  ...  By leveraging the data distribution idea, we propose DRONE (data-aware low-rank compression).  ... 
dblp:conf/nips/ChenYDH21 fatcat:cpk3ksi2wza2xmo2fqd6pqyfr4

Doping: A technique for Extreme Compression of LSTM Models using Sparse Structured Additive Matrices

Urmish Thakker, Paul N. Whatmough, Zhi Gang Liu, Matthew Mattina, Jesse G. Beu
2021 Conference on Machine Learning and Systems  
Additionally, we show that doped KP can be deployed on commodity hardware using the current software stack and achieve 2.5 − 5.5× inference run-time speed-up over baseline.  ...  like pruning and low-rank methods by a large margin (8% or more).  ...  This is especially significant for NLP applications that are running for long periods of time.  ... 
dblp:conf/mlsys/ThakkerWLMB21 fatcat:xcxa4buiq5hlfedzh337xpxxhy

Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications [article]

Matthew Khoury and Rumen Dangovski and Longwu Ou and Preslav Nakov and Yichen Shen and Li Jing
2020 arXiv   pre-print
However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications significantly at inference time.  ...  Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems.  ...  We hope that this work would bring the novel concept of AI co-design (between software and hardware) to the domain of NLP applications.  ... 
arXiv:2010.08412v1 fatcat:czgov7g46rc47gnun62dokaxyq

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [article]

Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul N. Whatmough, Alexander M. Rush, David Brooks, Gu-Yeon Wei
2021 arXiv   pre-print
We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP.  ...  Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping  ...  constraint of the application.  ... 
arXiv:2011.14203v5 fatcat:mqyhxlsll5dy3esl5sfqgp763q

Saec: Similarity-Aware Embedding Compression in Recommendation Systems [article]

Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, Jian Wang
2019 arXiv   pre-print
Testbed experiments show that Saec reduces the number of embedding vectors by two orders of magnitude, compresses the embedding size by ~27x, and delivers the same AUC and log loss performance.  ...  We propose a similarity-aware embedding matrix compression method called Saec to address this challenge. Saec clusters similar features within a field to reduce the embedding matrix size.  ...  Second, compression methods for embedding matrix [6, 20] are designed for NLP problems. The differences between NLP and recommendation systems are notable.  ... 
arXiv:1903.00103v1 fatcat:e3sbebjgxbajhpsb5oz6q2nwwu

Pay-Per-Request Deployment of Neural Network Models Using Serverless Architectures

Zhucheng Tu, Mengping Li, Jimmy Lin
2018 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations  
We demonstrate the serverless deployment of neural networks for model inferencing in NLP applications using Amazon's Lambda service for feedforward evaluation and DynamoDB for storing word embeddings.  ...  We describe a number of techniques that allow efficient use of serverless resources, and evaluations confirm that our design is both scalable and inexpensive.  ...  This research was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.  ... 
doi:10.18653/v1/n18-5002 dblp:conf/naacl/TuLL18 fatcat:ipm2zinp2nf6bh3n4lykzz7ezy

Deep Learning Meets Projective Clustering [article]

Alaa Maalouf and Harry Lang and Daniela Rus and Dan Feldman
2020 arXiv   pre-print
A common approach for compressing NLP networks is to encode the embedding layer as a matrix A∈ℝ^n× d, compute its rank-j approximation A_j via SVD, and then factor A_j into a pair of matrices that correspond  ...  Geometrically, the rows of A represent points in ℝ^d, and the rows of A_j represent their projections onto the j-dimensional subspace that minimizes the sum of squared distances ("errors") to the points  ...  (ii) classification may take too much time, especially for real-time applications such as NLP tasks: speech recognition, translation or speech-to-text.  ... 
arXiv:2010.04290v1 fatcat:uik6jgmaf5cu5i53srtwi7kbx4

Efficient Methods for Natural Language Processing: A Survey [article]

Marcos Treviso, Tianchu Ji, Ji-Ung Lee, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro H. Martins, André F. T. Martins, Peter Milder (+6 others)
2022 arXiv   pre-print
This survey relates and synthesises methods and findings in those efficiencies in NLP, aiming to guide new researchers in the field and inspire the development of new methods.  ...  Getting the most out of limited resources allows advances in natural language processing (NLP) research and practice while being conservative with resources.  ...  Acknowledgements This work was initiated at and benefitted substantially from the Dagstuhl Seminar 22232: Efficient and Equitable Natural Language Processing in the Age of Deep Learning.  ... 
arXiv:2209.00099v1 fatcat:uys5flozk5gz5ffcqxmwbtdmem

Towards Green AI with tensor networks – Sustainability and innovation enabled by efficient algorithms [article]

Eva Memmel, Clara Menzen, Jetze Schuurmans, Frederiek Wesel, Kim Batselier
2022 arXiv   pre-print
With this paper, we want to raise awareness about Green AI and showcase its positive impact on sustainability and AI research.  ...  Furthermore, we argue that better algorithms should be evaluated in terms of both accuracy and efficiency.  ...  Neural networks (NNs) have been made more efficient with TNs for a variety of application fields, including CV [40, 45, 53] and NLP [35, 57] .  ... 
arXiv:2205.12961v1 fatcat:nilium2wmjhhvpnivejwxvkbzq

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han
2022 ACM Transactions on Design Automation of Electronic Systems  
This article provides an overview of efficient deep learning methods, systems, and applications.  ...  Therefore, methods and techniques that are able to lift the efficiency bottleneck while preserving the high accuracy of DNNs are in great demand to enable numerous edge AI applications.  ...  Such long latency will hurt the user experience and make real-time NLP applications impossible on mobile devices. Therefore, efficient NLP techniques are of pressing demand.  ... 
doi:10.1145/3486618 fatcat:h6xwv2slo5eklift2fl24usine

A Survey on Green Deep Learning [article]

Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li
2021 arXiv   pre-print
In recent years, larger and deeper models are springing up and continuously pushing state-of-the-art (SOTA) results across various fields like natural language processing (NLP) and computer vision (CV)  ...  The target is to yield novel results with lightweight and efficient technologies. Many technologies can be used to achieve this goal, like model compression and knowledge distillation.  ...  However, FLOPs are theoretical values, and there is a gap between FLOPs and running time. In addition to the total amount of works (FLOPs), the degree of parallelism also affects the running time.  ... 
arXiv:2111.05193v2 fatcat:t2blz24y2jakteeeawqqogbkpy

Post-Training Quantization for Vision Transformer [article]

Zhenhua Liu, Yunhe Wang, Kai Han, Siwei Ma, Wen Gao
2021 arXiv   pre-print
Recently, transformer has achieved remarkable performance on a variety of computer vision applications.  ...  of each attention map and output feature.  ...  Compression of Transformer in NLP Owing to the remarkable performance of BERT in many NLP tasks, many researchers have tried to compress the model to reduce the memory and computation complexity of BERT  ... 
arXiv:2106.14156v1 fatcat:a7ieqevmyveahfutbtetjhb2jm

Connecting wikis and natural language processing systems

René Witte, Thomas Gitzinger
2007 Proceedings of the 2007 international symposium on Wikis - WikiSym '07  
We provide a number of practical application examples, including index generation, question answering, and automatic summarization, which demonstrate the practicability and usefulness of this idea.  ...  The vision is that of a "self-aware" Wiki system reading, understanding, transforming, and writing its own content, as well as supporting its users in information analysis and content development.  ...  Acknowledgments Ralf Krestel contributed to the automatic summarization NLP pipelines. Thomas Kappler contributed to the NLP-Wiki upload framework.  ... 
doi:10.1145/1296951.1296969 dblp:conf/wikis/WitteG07 fatcat:yq6ts5bedjglje6zlvpmcavxnq
« Previous Showing results 1 — 15 out of 1,494 results