Skip to content

Commit a9e8cbb

Browse files
committed
Merge pull request #45 from miso-belica/feature-markdown
README in Markdown format
2 parents bef8b85 + 63e93fc commit a9e8cbb

10 files changed

Lines changed: 187 additions & 164 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ __pycache__/
55
/build
66
/dist
77
/.cache
8+
README.rst
89

910
# tests
1011
.coverage

.travis.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ before_install:
1414
- sudo apt-get update -qq
1515
- sudo apt-get install -qq gfortran libatlas-base-dev
1616
- sudo apt-get install -qq python-numpy
17+
- sudo apt-get install -qq pandoc
1718
install:
19+
- pandoc --from=markdown --to=rst README.md -o README.rst
1820
- python setup.py install
1921
- pip install -U pip wheel
2022
- pip install -U --use-wheel pytest pytest-cov

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Changelog
2+
3+
## 0.4.0 (2015-12-04)
4+
- Dropped support for Python 2.6 and 3.2. Only 2.7/3.3+ are officially supported now. Time to move :)
5+
- CLI: Better message for unknown format.
6+
- LexRank: fixed power method computation.
7+
- Added some extra abbreviations (english, german) into tokenizer for better output.
8+
- SumBasic: Added new summarization method - SumBasic. Thanks to [Julian Griggs](https://github.com/JulianGriggs).
9+
- KL: Added new summarization method - KL. Thanks to [Julian Griggs](https://github.com/JulianGriggs).
10+
- Added dependency [requests](http://docs.python-requests.org/en/latest/) to fix issues with downloading pages.
11+
- Better documentation of expected Plaintext document format.
12+
13+
## 0.3.0 (2014-06-07)
14+
- Added possibility to specify format of input document for URL & stdin. Thanks to [@Lucas-C](https://github.com/Lucas-C).
15+
- Added possibility to specify custom file with stop-words in CLI. Thanks to [@Lucas-C](https://github.com/Lucas-C).
16+
- Added support for French language (added stopwords & stemmer). Thanks to [@Lucas-C](https://github.com/Lucas-C).
17+
- Function `sumy.utils.get_stop_words` raises `LookupError` instead of `ValueError` for unknown language.
18+
- Exception `LookupError` is raised for unknown language of stemmer instead of falling silently to `null_stemmer`.
19+
20+
## 0.2.1 (2014-01-23)
21+
- Fixed installation of my own readability fork. Added `breadability` to the dependencies instead of it [#8](https://github.com/miso-belica/sumy/issues/8).
22+
Thanks to [@pratikpoddar](https://github.com/pratikpoddar).
23+
24+
## 0.2.0 (2014-01-18)
25+
- Removed dependency on SciPy [#7](https://github.com/miso-belica/sumy/pull/7). Use `numpy.linalg.svd` implementation.
26+
Thanks to [Shantanu](https://github.com/baali).
27+
28+
## 0.1.0 (2013-10-20)
29+
- First public release.

CHANGELOG.rst

Lines changed: 0 additions & 35 deletions
This file was deleted.
File renamed without changes.

MANIFEST.in

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1+
include README.md
12
include README.rst
2-
include LICENSE.rst
3-
include CHANGELOG.rst
3+
include LICENSE.txt
4+
include CHANGELOG.md
45
recursive-include sumy/data *

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ test:
88
py.test
99

1010
publish: test
11+
pandoc --from=markdown --to=rst README.md -o README.rst
1112
${PYTHON} setup.py register sdist bdist_wheel
1213
twine upload dist/*
1314

README.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Automatic text summarizer
2+
3+
[![image](https://api.travis-ci.org/miso-belica/sumy.png?branch=master)](https://travis-ci.org/miso-belica/sumy)
4+
5+
Simple library and command line utility for extracting summary from HTML
6+
pages or plain texts. The package also contains simple evaluation
7+
framework for text summaries. Implemented summarization methods:
8+
9+
- **Luhn** - heurestic method,
10+
[reference](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5392672)
11+
- **Edmundson** heurestic method with previous statistic research,
12+
[reference](http://dl.acm.org/citation.cfm?doid=321510.321519)
13+
- **Latent Semantic Analysis, LSA** - one of the algorithm from
14+
<http://scholar.google.com/citations?user=0fTuW_YAAAAJ&hl=en> I
15+
think the author is using more advanced algorithms now.
16+
[Steinberger, J. a Ježek, K. Using latent semantic an and
17+
summary evaluation. In In Proceedings ISIM '04. 2004. S.
18+
93-100.](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf)
19+
- **LexRank** - Unsupervised approach inspired by algorithms PageRank
20+
and HITS,
21+
[reference](http://tangra.si.umich.edu/~radev/lexrank/lexrank.pdf)
22+
- **TextRank** - some sort of combination of a few resources that I
23+
found on the internet. I really don't remember the sources. Probably
24+
[Wikipedia](https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank)
25+
and some papers in 1st page of Google :)
26+
- **SumBasic** - Method that is often used as a baseline in
27+
the literature. Source: [Read about
28+
SumBasic](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf)
29+
- **KL-Sum** - Method that greedily adds sentences to a summary so
30+
long as it decreases the KL Divergence. Source: [Read about
31+
KL-Sum](http://www.aclweb.org/anthology/N09-1041)
32+
33+
Here are some other summarizers:
34+
35+
- <https://github.com/thavelick/summarize/> - Python, TF (very simple)
36+
- [Reduction](https://github.com/adamfabish/Reduction) - Python,
37+
TextRank (simple)
38+
- [Open Text Summarizer](http://libots.sourceforge.net/) - C, TF
39+
without normalization
40+
- [Simple program that summarize
41+
text](https://github.com/xhresko/text-summarizer) - Python, TF
42+
without normalization
43+
- [Intro to Computational
44+
Linguistics](https://github.com/kylehardgrave/summarizer) - Java,
45+
LexRank
46+
- [Sumtract: Second project for UW LING
47+
572](https://github.com/stefanbehr/sumtract) - Python
48+
- [TextTeaser](https://github.com/MojoJolo/textteaser) - Scala
49+
- [PyTeaser](https://github.com/xiaoxu193/PyTeaser) - TextTeaser port
50+
in Python
51+
- [Automatic Document
52+
Summarizer](https://github.com/himanshujindal/Automatic-Text-Summarizer) -
53+
Java, Bipartite HITS (no sources)
54+
- [Pythia](https://github.com/giorgosera/pythia/blob/dev/analysis/summarization/summarization.py) -
55+
Python, LexRank & Centroid
56+
- [SWING](https://github.com/WING-NUS/SWING) - Ruby
57+
- [Topic Networks](https://github.com/bobflagg/Topic-Networks) - R,
58+
topic models & bipartite graphs
59+
- [Almus: Automatic Text
60+
Summarizer](http://textmining.zcu.cz/?lang=en&section=download) -
61+
Java, LSA (without source code)
62+
- [Musutelsa](http://www.musutelsa.jamstudio.eu/) - Java, LSA
63+
(always freezes)
64+
- <http://mff.bajecni.cz/index.php> - C++
65+
- [MEAD](http://www.summarization.com/mead/) - Perl, various methods +
66+
evaluation framework
67+
68+
## Installation
69+
70+
Make sure you have [Python](http://www.python.org/) 2.7/3.3+ and
71+
[pip](https://crate.io/packages/pip/)
72+
([Windows](http://docs.python-guide.org/en/latest/starting/install/win/),
73+
[Linux](http://docs.python-guide.org/en/latest/starting/install/linux/))
74+
installed. Run simply (preferred way):
75+
76+
```sh
77+
$ [sudo] pip install sumy
78+
```
79+
80+
Or for the fresh version:
81+
82+
```sh
83+
$ [sudo] pip install git+git://github.com/miso-belica/sumy.git
84+
```
85+
86+
## Usage
87+
88+
Sumy contains command line utility for quick summarization of documents.
89+
90+
```sh
91+
$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
92+
$ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
93+
$ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
94+
$ sumy --help # for more info
95+
```
96+
97+
Various evaluation methods for some summarization method can be executed
98+
by commands below:
99+
100+
```sh
101+
$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization
102+
$ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
103+
$ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
104+
$ sumy_eval --help # for more info
105+
```
106+
107+
## Python API
108+
109+
Or you can use sumy like a library in your project.
110+
111+
```python
112+
# -*- coding: utf8 -*-
113+
114+
from __future__ import absolute_import
115+
from __future__ import division, print_function, unicode_literals
116+
117+
from sumy.parsers.html import HtmlParser
118+
from sumy.parsers.plaintext import PlaintextParser
119+
from sumy.nlp.tokenizers import Tokenizer
120+
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
121+
from sumy.nlp.stemmers import Stemmer
122+
from sumy.utils import get_stop_words
123+
124+
125+
LANGUAGE = "czech"
126+
SENTENCES_COUNT = 10
127+
128+
129+
if __name__ == "__main__":
130+
url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html"
131+
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
132+
# or for plain text files
133+
# parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
134+
stemmer = Stemmer(LANGUAGE)
135+
136+
summarizer = Summarizer(stemmer)
137+
summarizer.stop_words = get_stop_words(LANGUAGE)
138+
139+
for sentence in summarizer(parser.document, SENTENCES_COUNT):
140+
print(sentence)
141+
```
142+
143+
## Tests
144+
145+
Run tests via
146+
147+
```sh
148+
$ py.test-2.7 && py.test-3.3 && py.test-3.4 && py.test-3.5
149+
```

README.rst

Lines changed: 0 additions & 124 deletions
This file was deleted.

setup.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@
1212

1313

1414
with open("README.rst") as readme:
15-
with open("CHANGELOG.rst") as changelog:
16-
long_description = readme.read() + "\n\n" + changelog.read()
15+
long_description = readme.read()
1716

1817

1918
setup(
@@ -74,7 +73,7 @@
7473
]
7574
},
7675
classifiers=[
77-
"Development Status :: 3 - Alpha",
76+
"Development Status :: 3 - Beta",
7877
"Intended Audience :: Developers",
7978
"Intended Audience :: Education",
8079
"License :: OSI Approved :: Apache Software License",

0 commit comments

Comments
 (0)