This repository contains the automatically annotated citations for the ScisummNet dataset using the classifier trained on the annotated collection scisumm-corpus.
Posts by Collection
The dataset contain a collection of URLs that can be used to crawl domain specific dataset with manually annotated highlights
Poli2Sum@ CL-SciSumm-19: Identify, Classify, and Summarize Cited Text Spans by means of Ensembles of Supervised Models
Published in In 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) @ SIGIR 2019 (Vol. 2414, pp. 233–246), 2019
This paper presents the Poli2Sum approach to the 5th Computational Linguistics Scientific Document Summarization Shared Task (BIRNDL CL-SciSumm 2019).
Recommended citation: La Quatra, M., Cagliero, L., & Baralis, E. (2019). Poli2Sum@CL-SciSumm-19: Identify, Classify, and Summarize Cited Text Spans by means of Ensembles of Supervised Models. In 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) @ SIGIR 2019 (Vol. 2414, pp. 233–246). http://ceur-ws.org/Vol-2414/paper24.pdf
Combining Machine Learning and Natural Language Processing for Language-Specific, Multi-Lingual, and Cross-Lingual Text Summarization: A Wide-Ranging Overview
Published in Trends and Applications of Text Summarization Techniques, 2019
The recent advances in multimedia and web-based applications have eased the accessibility to large collections of textual documents. To automate the process of document analysis, the research community has put relevant efforts into extracting short summaries of the document content.
Recommended citation: Cagliero, Luca, Paolo Garza, and Moreno La Quatra. "Combining Machine Learning and Natural Language Processing for Language-Specific, Multi-Lingual, and Cross-Lingual Text Summarization: A Wide-Ranging Overview." Trends and Applications of Text Summarization Techniques. IGI Global, 2020. 1-31. https://www.igi-global.com/chapter/combining-machine-learning-and-natural-language-processing-for-language-specific-multi-lingual-and-cross-lingual-text-summarization/235739
Published in Electronics, 9(1), 2020
In the context of hospitality management, a challenging research problem is to identify effective strategies to explain hotel reviews and ratings and their correlation with the urban context. Under this umbrella, the paper investigates the use of sentence-based embedding models to deeply explore the similarities and dissimilarities between cities in terms of the corresponding hotel reviews and the surrounding points of interests.
Recommended citation: Cagliero, L.; La Quatra, M.; Apiletti, D. From Hotel Reviews to City Similarities: A Unified Latent-Space Model. Electronics 2020, 9, 197. https://www.mdpi.com/2079-9292/9/1/197
Published in Scientometrics (2020), 2020
This paper proposes a new, more effective solution to the CL-SciSumm discourse facet classification task, which entails identifying for each cited text span what facet of the paper it belongs to from a predefined set of facets.
Recommended citation: La Quatra, M., Cagliero, L. & Baralis, E. Exploiting pivot words to classify and summarize discourse facets of scientific papers. Scientometrics (2020). https://doi.org/10.1007/s11192-020-03532-3 https://doi.org/10.1007/s11192-020-03532-3
Published in Expert Systems With Applications (2020), 2020
This paper presents a supervised approach, based on regression techniques, with the twofold aim at automatically extracting highlights of past articles with missing annotations and simplifying the process of manually annotating new articles.
Recommended citation: Cagliero L. & La Quatra M., Extracting Highlights of Scientific Articles: a Supervised Summarization Approach, Expert Systems with Applications, 2020, 113659, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2020.113659. https://doi.org/10.1016/j.eswa.2020.113659
Published in 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, 2020
The proposed methodology exploit the advancements in the Natural Language Understanding field to create a fine-tuned architecture able to summarize financial documents.
Recommended citation: La Quatra, M., & Cagliero, L. (2020, December). End-to-end Training For Financial Report Summarization. In Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (pp. 118-123). https://www.aclweb.org/anthology/2020.fnp-1.20/
Published in ACM SIGIR 2021, 2021
Timeline summarization aims at presenting long news stories in a compact manner. This paper proposes a new approach, namely Summarize Date First, which focuses on first generating date-level summaries then selecting the most relevant dates on top of summarized knowledge. In the latter stage, it performs date aggregations to consider high-level temporal references as well.
Recommended citation: Moreno La Quatra, Luca Cagliero, Elena Baralis, Alberto Messina, and Maurizio Montagnuolo. 2021. Summarize Dates First: A Paradigm Shift in Timeline Summarization. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 21). Association for Computing Machinery, New York, NY, USA, 418–427. DOI:https://doi.org/10.1145/3404835.3462954 https://doi.org/10.1145/3404835.3462954
Published in IEEE COMPSAC 2021, 2021
The aim of this paper is to use unsupervised summarization methods to generate sentence-level summaries of the paper sections, which are then refined by applying an optimization step. It evaluates the quality of the output slides by taking into account the original paper structure as well. The results, achieved on a benchmark collection of papers and slides, show that unsupervised models performed better than supervised ones on specific paper facets.
Recommended citation: L. Cagliero and M. L. Quatra, "Automatic slides generation in the absence of training data," 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), 2021, pp. 103-108, doi: 10.1109/COMPSAC51774.2021.00025. https://doi.org/10.1109/COMPSAC51774.2021.00025
Published in Springer Scientometrics, 2021
This paper proposes a classification approach to automatically predict whether the cited snippets in the full-text of the paper contain a significant amount of new content beyond abstract and title. The proposed approach could support researchers in leveraging full-text article exploration for citation analysis. The experiments conducted on real scientific articles show promising results: the classifier has a 90% chance to correctly distinguish between the full-text exploration and only title and abstract cases
Recommended citation: La Quatra, M., Cagliero, L. & Baralis, E. Leveraging full-text article exploration for citation analysis. Scientometrics (2021). https://doi.org/10.1007/s11192-021-04117-4 https://doi.org/10.1007/s11192-021-04117-4
Published in IEEE Access (2021), 2021
This paper proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source language (typically, English). The proposed inference method relies on a two-step process, which first automatically identifies domain-specific words and then opportunistically reuses the non-linear space transformations applied to the word vectors of the source language.
Recommended citation: L. Cagliero and M. La Quatra, "Inferring Multilingual Domain-Specific Word Embeddings From Large Document Corpora," in IEEE Access, vol. 9, pp. 137309-137321, 2021, doi: 10.1109/ACCESS.2021.3118093. https://doi.org/10.1109/ACCESS.2021.3118093
Supervised models are trained on a variety of data features related to the structure, semantics and syntax of the text. The idea behind is to effectively explore the latent connections between citing context and sentences in the reference paper.
Thanks to the world-scale diffusion of web-based applications, digital libraries are playing a foundamental role in giving access to research papers thus allowing researchers to disseminate their main research findings. Our work focuses on extracting the sentences that best summarize the main topics and finding of the research manuscript in an automated manner.
Using a video presentation I show some of the research trends investigated during my PhD studies. I give an very high-level overview multilingual and timeline summarization tasks that we address by using NLP and Deep Learning.
The summarization architecture proposed for the FNS 2020 shared task is based on a three-phases process.
- Preprocessing step: clean input financial reports and annotate its content at sentence level.
- Training step: deep learning models are fine-tuned for the regression task exploiting the annotations obtained during the preprocessing step.
- Evaluation phase: is applied at document level. The sentences of each annual reports make a forward pass through the fine-tuned model to obtain the estimated relevance score. The final summary merges sentences according to the relevance score predicted by the fine-tuned architecture.
This the video presentation for the Paper: “Summarize Dates First: A Paradigm Shift in Timeline Summarization”
Teaching assistant for undergraduate course, Politecnico di Torino, DAUIN, 2019
This course is an introduction to databases for undergraduate students of management engineering. The training activities address the following topics:
Teaching assistant for undergraduate course, Politecnico di Torino, DAUIN, 2020
This course is an introduction to databases for undergraduate students of computer engineering. The training activities address the following topics:
Teaching assistant for master course, Politecnico di Torino, DAUIN, 2021
This is a master-level course for student in “Data Science and Engineering” specialization. I’m involved as teaching assistant both for in-class and Lab practices. An unextensive list of the topics of the course is reported below: