Abstract:
Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a ...Show MoreMetadata
Abstract:
Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.
Published in: 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)
Date of Conference: 01-04 September 2020
Date Added to IEEE Xplore: 24 February 2021
ISBN Information:
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Submodular Function ,
- Extractive Summarization ,
- Graphical Representation ,
- Multiple Languages ,
- Unsupervised Algorithm ,
- Text Representation ,
- Video Summarization ,
- Parallel Corpus ,
- Text Cohesion ,
- Similarity Score ,
- Long Short-term Memory ,
- Similarity Matrix ,
- Word Embedding ,
- Original Text ,
- Annotated Dataset ,
- Term Frequency-inverse Document Frequency ,
- Number Of Sentences ,
- Semantic Space ,
- Lemmatization ,
- Sentence In The Text ,
- Semantic Distance ,
- Sentence Embedding ,
- Target Text ,
- Distance Formula ,
- Pre-trained Word Embeddings ,
- Text Preprocessing
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Submodular Function ,
- Extractive Summarization ,
- Graphical Representation ,
- Multiple Languages ,
- Unsupervised Algorithm ,
- Text Representation ,
- Video Summarization ,
- Parallel Corpus ,
- Text Cohesion ,
- Similarity Score ,
- Long Short-term Memory ,
- Similarity Matrix ,
- Word Embedding ,
- Original Text ,
- Annotated Dataset ,
- Term Frequency-inverse Document Frequency ,
- Number Of Sentences ,
- Semantic Space ,
- Lemmatization ,
- Sentence In The Text ,
- Semantic Distance ,
- Sentence Embedding ,
- Target Text ,
- Distance Formula ,
- Pre-trained Word Embeddings ,
- Text Preprocessing
- Author Keywords