README in Markdown format

miso-belica · miso-belica · commit 63e93fcdaeeb · 2015-12-06T23:21:29.000+01:00
diff --git a/.gitignore b/.gitignore
@@ -5,6 +5,7 @@ __pycache__/
 /build
 /dist
 /.cache
+README.rst
 
 # tests
 .coverage
diff --git a/.travis.yml b/.travis.yml
@@ -14,7 +14,9 @@ before_install:
   - sudo apt-get update -qq
   - sudo apt-get install -qq gfortran libatlas-base-dev
   - sudo apt-get install -qq python-numpy
+  - sudo apt-get install -qq pandoc
 install:
+  - pandoc --from=markdown --to=rst README.md -o README.rst
   - python setup.py install
   - pip install -U pip wheel
   - pip install -U --use-wheel pytest pytest-cov
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,29 @@
+# Changelog
+
+## 0.4.0 (2015-12-04)
+-   Dropped support for Python 2.6 and 3.2. Only 2.7/3.3+ are officially supported now. Time to move :)
+-   CLI: Better message for unknown format.
+-   LexRank: fixed power method computation.
+-   Added some extra abbreviations (english, german) into tokenizer for better output.
+-   SumBasic: Added new summarization method - SumBasic. Thanks to [Julian Griggs](https://github.com/JulianGriggs).
+-   KL: Added new summarization method - KL. Thanks to [Julian Griggs](https://github.com/JulianGriggs).
+-   Added dependency [requests](http://docs.python-requests.org/en/latest/) to fix issues with downloading pages.
+-   Better documentation of expected Plaintext document format.
+
+## 0.3.0 (2014-06-07)
+-   Added possibility to specify format of input document for URL & stdin. Thanks to [@Lucas-C](https://github.com/Lucas-C).
+-   Added possibility to specify custom file with stop-words in CLI. Thanks to [@Lucas-C](https://github.com/Lucas-C).
+-   Added support for French language (added stopwords & stemmer). Thanks to [@Lucas-C](https://github.com/Lucas-C).
+-   Function `sumy.utils.get_stop_words` raises `LookupError` instead of `ValueError` for unknown language.
+-   Exception `LookupError` is raised for unknown language of stemmer instead of falling silently to `null_stemmer`.
+
+## 0.2.1 (2014-01-23)
+-   Fixed installation of my own readability fork. Added `breadability` to the dependencies instead of it [#8](https://github.com/miso-belica/sumy/issues/8). 
+    Thanks to [@pratikpoddar](https://github.com/pratikpoddar).
+
+## 0.2.0 (2014-01-18)
+-   Removed dependency on SciPy [#7](https://github.com/miso-belica/sumy/pull/7). Use `numpy.linalg.svd` implementation. 
+    Thanks to [Shantanu](https://github.com/baali).
+
+## 0.1.0 (2013-10-20)
+-   First public release.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
diff --git a/LICENSE.txt b/LICENSE.txt
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,4 +1,5 @@
+include README.md
 include README.rst
-include LICENSE.rst
-include CHANGELOG.rst
+include LICENSE.txt
+include CHANGELOG.md
 recursive-include sumy/data *
diff --git a/Makefile b/Makefile
@@ -8,6 +8,7 @@ test:
 	py.test
 
 publish: test
+	pandoc --from=markdown --to=rst README.md -o README.rst
 	${PYTHON} setup.py register sdist bdist_wheel
 	twine upload dist/*
 
diff --git a/README.md b/README.md
@@ -0,0 +1,149 @@
+# Automatic text summarizer
+
+[![image](https://api.travis-ci.org/miso-belica/sumy.png?branch=master)](https://travis-ci.org/miso-belica/sumy)
+
+Simple library and command line utility for extracting summary from HTML
+pages or plain texts. The package also contains simple evaluation
+framework for text summaries. Implemented summarization methods:
+
+-   **Luhn** - heurestic method,
+    [reference](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5392672)
+-   **Edmundson** heurestic method with previous statistic research,
+    [reference](http://dl.acm.org/citation.cfm?doid=321510.321519)
+-   **Latent Semantic Analysis, LSA** - one of the algorithm from
+    <http://scholar.google.com/citations?user=0fTuW_YAAAAJ&hl=en> I
+    think the author is using more advanced algorithms now.
+    [Steinberger, J. a Ježek, K. Using latent semantic an and
+    summary evaluation. In In Proceedings ISIM '04. 2004. S.
+    93-100.](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf)
+-   **LexRank** - Unsupervised approach inspired by algorithms PageRank
+    and HITS,
+    [reference](http://tangra.si.umich.edu/~radev/lexrank/lexrank.pdf)
+-   **TextRank** - some sort of combination of a few resources that I
+    found on the internet. I really don't remember the sources. Probably
+    [Wikipedia](https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank)
+    and some papers in 1st page of Google :)
+-   **SumBasic** - Method that is often used as a baseline in
+    the literature. Source: [Read about
+    SumBasic](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf)
+-   **KL-Sum** - Method that greedily adds sentences to a summary so
+    long as it decreases the KL Divergence. Source: [Read about
+    KL-Sum](http://www.aclweb.org/anthology/N09-1041)
+
+Here are some other summarizers:
+
+-   <https://github.com/thavelick/summarize/> - Python, TF (very simple)
+-   [Reduction](https://github.com/adamfabish/Reduction) - Python,
+    TextRank (simple)
+-   [Open Text Summarizer](http://libots.sourceforge.net/) - C, TF
+    without normalization
+-   [Simple program that summarize
+    text](https://github.com/xhresko/text-summarizer) - Python, TF
+    without normalization
+-   [Intro to Computational
+    Linguistics](https://github.com/kylehardgrave/summarizer) - Java,
+    LexRank
+-   [Sumtract: Second project for UW LING
+    572](https://github.com/stefanbehr/sumtract) - Python
+-   [TextTeaser](https://github.com/MojoJolo/textteaser) - Scala
+-   [PyTeaser](https://github.com/xiaoxu193/PyTeaser) - TextTeaser port
+    in Python
+-   [Automatic Document
+    Summarizer](https://github.com/himanshujindal/Automatic-Text-Summarizer) -
+    Java, Bipartite HITS (no sources)
+-   [Pythia](https://github.com/giorgosera/pythia/blob/dev/analysis/summarization/summarization.py) -
+    Python, LexRank & Centroid
+-   [SWING](https://github.com/WING-NUS/SWING) - Ruby
+-   [Topic Networks](https://github.com/bobflagg/Topic-Networks) - R,
+    topic models & bipartite graphs
+-   [Almus: Automatic Text
+    Summarizer](http://textmining.zcu.cz/?lang=en&section=download) -
+    Java, LSA (without source code)
+-   [Musutelsa](http://www.musutelsa.jamstudio.eu/) - Java, LSA
+    (always freezes)
+-   <http://mff.bajecni.cz/index.php> - C++
+-   [MEAD](http://www.summarization.com/mead/) - Perl, various methods +
+    evaluation framework
+
+## Installation
+
+Make sure you have [Python](http://www.python.org/) 2.7/3.3+ and
+[pip](https://crate.io/packages/pip/)
+([Windows](http://docs.python-guide.org/en/latest/starting/install/win/),
+[Linux](http://docs.python-guide.org/en/latest/starting/install/linux/))
+installed. Run simply (preferred way):
+
+```sh
+$ [sudo] pip install sumy
+```
+
+Or for the fresh version:
+
+```sh
+$ [sudo] pip install git+git://github.com/miso-belica/sumy.git
+```
+
+## Usage
+
+Sumy contains command line utility for quick summarization of documents.
+
+```sh
+$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
+$ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
+$ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
+$ sumy --help # for more info
+```
+
+Various evaluation methods for some summarization method can be executed
+by commands below:
+
+```sh
+$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization
+$ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
+$ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
+$ sumy_eval --help # for more info
+```
+
+## Python API
+
+Or you can use sumy like a library in your project.
+
+```python
+# -*- coding: utf8 -*-
+
+from __future__ import absolute_import
+from __future__ import division, print_function, unicode_literals
+
+from sumy.parsers.html import HtmlParser
+from sumy.parsers.plaintext import PlaintextParser
+from sumy.nlp.tokenizers import Tokenizer
+from sumy.summarizers.lsa import LsaSummarizer as Summarizer
+from sumy.nlp.stemmers import Stemmer
+from sumy.utils import get_stop_words
+
+
+LANGUAGE = "czech"
+SENTENCES_COUNT = 10
+
+
+if __name__ == "__main__":
+    url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html"
+    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
+    # or for plain text files
+    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
+    stemmer = Stemmer(LANGUAGE)
+
+    summarizer = Summarizer(stemmer)
+    summarizer.stop_words = get_stop_words(LANGUAGE)
+
+    for sentence in summarizer(parser.document, SENTENCES_COUNT):
+        print(sentence)
+```
+
+## Tests
+
+Run tests via
+
+```sh
+$ py.test-2.7 && py.test-3.3 && py.test-3.4 && py.test-3.5
+```
diff --git a/README.rst b/README.rst
diff --git a/setup.py b/setup.py
@@ -12,8 +12,7 @@
 
 
 with open("README.rst") as readme:
-    with open("CHANGELOG.rst") as changelog:
-        long_description = readme.read() + "\n\n" + changelog.read()
+    long_description = readme.read()
 
 
 setup(
@@ -74,7 +73,7 @@
         ]
     },
     classifiers=[
-        "Development Status :: 3 - Alpha",
+        "Development Status :: 3 - Beta",
         "Intended Audience :: Developers",
         "Intended Audience :: Education",
         "License :: OSI Approved :: Apache Software License",