|
| 1 | +# Automatic text summarizer |
| 2 | + |
| 3 | +[](https://travis-ci.org/miso-belica/sumy) |
| 4 | + |
| 5 | +Simple library and command line utility for extracting summary from HTML |
| 6 | +pages or plain texts. The package also contains simple evaluation |
| 7 | +framework for text summaries. Implemented summarization methods: |
| 8 | + |
| 9 | +- **Luhn** - heurestic method, |
| 10 | + [reference](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5392672) |
| 11 | +- **Edmundson** heurestic method with previous statistic research, |
| 12 | + [reference](http://dl.acm.org/citation.cfm?doid=321510.321519) |
| 13 | +- **Latent Semantic Analysis, LSA** - one of the algorithm from |
| 14 | + <http://scholar.google.com/citations?user=0fTuW_YAAAAJ&hl=en> I |
| 15 | + think the author is using more advanced algorithms now. |
| 16 | + [Steinberger, J. a Ježek, K. Using latent semantic an and |
| 17 | + summary evaluation. In In Proceedings ISIM '04. 2004. S. |
| 18 | + 93-100.](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf) |
| 19 | +- **LexRank** - Unsupervised approach inspired by algorithms PageRank |
| 20 | + and HITS, |
| 21 | + [reference](http://tangra.si.umich.edu/~radev/lexrank/lexrank.pdf) |
| 22 | +- **TextRank** - some sort of combination of a few resources that I |
| 23 | + found on the internet. I really don't remember the sources. Probably |
| 24 | + [Wikipedia](https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank) |
| 25 | + and some papers in 1st page of Google :) |
| 26 | +- **SumBasic** - Method that is often used as a baseline in |
| 27 | + the literature. Source: [Read about |
| 28 | + SumBasic](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf) |
| 29 | +- **KL-Sum** - Method that greedily adds sentences to a summary so |
| 30 | + long as it decreases the KL Divergence. Source: [Read about |
| 31 | + KL-Sum](http://www.aclweb.org/anthology/N09-1041) |
| 32 | + |
| 33 | +Here are some other summarizers: |
| 34 | + |
| 35 | +- <https://github.com/thavelick/summarize/> - Python, TF (very simple) |
| 36 | +- [Reduction](https://github.com/adamfabish/Reduction) - Python, |
| 37 | + TextRank (simple) |
| 38 | +- [Open Text Summarizer](http://libots.sourceforge.net/) - C, TF |
| 39 | + without normalization |
| 40 | +- [Simple program that summarize |
| 41 | + text](https://github.com/xhresko/text-summarizer) - Python, TF |
| 42 | + without normalization |
| 43 | +- [Intro to Computational |
| 44 | + Linguistics](https://github.com/kylehardgrave/summarizer) - Java, |
| 45 | + LexRank |
| 46 | +- [Sumtract: Second project for UW LING |
| 47 | + 572](https://github.com/stefanbehr/sumtract) - Python |
| 48 | +- [TextTeaser](https://github.com/MojoJolo/textteaser) - Scala |
| 49 | +- [PyTeaser](https://github.com/xiaoxu193/PyTeaser) - TextTeaser port |
| 50 | + in Python |
| 51 | +- [Automatic Document |
| 52 | + Summarizer](https://github.com/himanshujindal/Automatic-Text-Summarizer) - |
| 53 | + Java, Bipartite HITS (no sources) |
| 54 | +- [Pythia](https://github.com/giorgosera/pythia/blob/dev/analysis/summarization/summarization.py) - |
| 55 | + Python, LexRank & Centroid |
| 56 | +- [SWING](https://github.com/WING-NUS/SWING) - Ruby |
| 57 | +- [Topic Networks](https://github.com/bobflagg/Topic-Networks) - R, |
| 58 | + topic models & bipartite graphs |
| 59 | +- [Almus: Automatic Text |
| 60 | + Summarizer](http://textmining.zcu.cz/?lang=en§ion=download) - |
| 61 | + Java, LSA (without source code) |
| 62 | +- [Musutelsa](http://www.musutelsa.jamstudio.eu/) - Java, LSA |
| 63 | + (always freezes) |
| 64 | +- <http://mff.bajecni.cz/index.php> - C++ |
| 65 | +- [MEAD](http://www.summarization.com/mead/) - Perl, various methods + |
| 66 | + evaluation framework |
| 67 | + |
| 68 | +## Installation |
| 69 | + |
| 70 | +Make sure you have [Python](http://www.python.org/) 2.7/3.3+ and |
| 71 | +[pip](https://crate.io/packages/pip/) |
| 72 | +([Windows](http://docs.python-guide.org/en/latest/starting/install/win/), |
| 73 | +[Linux](http://docs.python-guide.org/en/latest/starting/install/linux/)) |
| 74 | +installed. Run simply (preferred way): |
| 75 | + |
| 76 | +```sh |
| 77 | +$ [sudo] pip install sumy |
| 78 | +``` |
| 79 | + |
| 80 | +Or for the fresh version: |
| 81 | + |
| 82 | +```sh |
| 83 | +$ [sudo] pip install git+git://github.com/miso-belica/sumy.git |
| 84 | +``` |
| 85 | + |
| 86 | +## Usage |
| 87 | + |
| 88 | +Sumy contains command line utility for quick summarization of documents. |
| 89 | + |
| 90 | +```sh |
| 91 | +$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization? |
| 92 | +$ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ |
| 93 | +$ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan |
| 94 | +$ sumy --help # for more info |
| 95 | +``` |
| 96 | + |
| 97 | +Various evaluation methods for some summarization method can be executed |
| 98 | +by commands below: |
| 99 | + |
| 100 | +```sh |
| 101 | +$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization |
| 102 | +$ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ |
| 103 | +$ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan |
| 104 | +$ sumy_eval --help # for more info |
| 105 | +``` |
| 106 | + |
| 107 | +## Python API |
| 108 | + |
| 109 | +Or you can use sumy like a library in your project. |
| 110 | + |
| 111 | +```python |
| 112 | +# -*- coding: utf8 -*- |
| 113 | + |
| 114 | +from __future__ import absolute_import |
| 115 | +from __future__ import division, print_function, unicode_literals |
| 116 | + |
| 117 | +from sumy.parsers.html import HtmlParser |
| 118 | +from sumy.parsers.plaintext import PlaintextParser |
| 119 | +from sumy.nlp.tokenizers import Tokenizer |
| 120 | +from sumy.summarizers.lsa import LsaSummarizer as Summarizer |
| 121 | +from sumy.nlp.stemmers import Stemmer |
| 122 | +from sumy.utils import get_stop_words |
| 123 | + |
| 124 | + |
| 125 | +LANGUAGE = "czech" |
| 126 | +SENTENCES_COUNT = 10 |
| 127 | + |
| 128 | + |
| 129 | +if __name__ == "__main__": |
| 130 | + url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html" |
| 131 | + parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE)) |
| 132 | + # or for plain text files |
| 133 | + # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE)) |
| 134 | + stemmer = Stemmer(LANGUAGE) |
| 135 | + |
| 136 | + summarizer = Summarizer(stemmer) |
| 137 | + summarizer.stop_words = get_stop_words(LANGUAGE) |
| 138 | + |
| 139 | + for sentence in summarizer(parser.document, SENTENCES_COUNT): |
| 140 | + print(sentence) |
| 141 | +``` |
| 142 | + |
| 143 | +## Tests |
| 144 | + |
| 145 | +Run tests via |
| 146 | + |
| 147 | +```sh |
| 148 | +$ py.test-2.7 && py.test-3.3 && py.test-3.4 && py.test-3.5 |
| 149 | +``` |
0 commit comments