@@ -27,10 +27,17 @@ so the shape of your library is version-controlled even if the contents are not.
2727## Requirements
2828
2929``` bash
30- pip3 install pymupdf
30+ pip3 install -r requirements.txt
3131```
3232
33- Python 3.10+. No other dependencies.
33+ Or install individually:
34+
35+ ``` bash
36+ pip3 install pymupdf # required for convert.py
37+ pip3 install internetarchive # required for archive.org downloads
38+ ```
39+
40+ Python 3.10+.
3441
3542---
3643
@@ -51,6 +58,10 @@ Python 3.10+. No other dependencies.
5158
5259### 1. Download a collection
5360
61+ The source is auto-detected from the URL. Both modes share ` --output-dir ` , ` --delay ` , and ` --dry-run ` .
62+
63+ ** World Radio History** — scrapes PDF links from an archive page:
64+
5465``` bash
5566# Preview what would be downloaded
5667python3 download.py " https://www.worldradiohistory.com/ETI_Magazine.htm" --dry-run
@@ -64,6 +75,36 @@ python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
6475 --filter " 1970" --output-dir collections/eti/pdfs
6576```
6677
78+ ** archive.org** — downloads files from a single archive.org item by identifier.
79+ Each issue typically has two PDF variants: a plain image PDF and a ` _text.pdf ` with an
80+ Abbyy OCR text layer. The ` --pdf-format ` flag controls which variant is downloaded
81+ (` text ` is the default since ` convert.py ` extracts from the OCR layer):
82+
83+ ``` bash
84+ # Download all OCR PDFs from an archive.org item
85+ python3 download.py " https://archive.org/details/ElektorMagazine" \
86+ --output-dir collections/elektor/pdfs
87+
88+ # Download only issues from a specific decade
89+ python3 download.py " https://archive.org/details/ElektorMagazine" \
90+ --output-dir collections/elektor/pdfs \
91+ --year-from 1974 --year-to 1989
92+
93+ # Download image-only PDFs (no OCR layer)
94+ python3 download.py " https://archive.org/details/ElektorMagazine" \
95+ --pdf-format image --output-dir collections/elektor/pdfs
96+
97+ # Preview without downloading
98+ python3 download.py " https://archive.org/details/ElektorMagazine" \
99+ --year-from 1980 --dry-run
100+ ```
101+
102+ | Flag | Description | Default |
103+ | --- | --- | --- |
104+ | ` --pdf-format ` | ` text ` (_ text.pdf, OCR), ` image ` (plain PDF), ` both ` | ` text ` |
105+ | ` --year-from ` | Only download files with a year >= this value | — |
106+ | ` --year-to ` | Only download files with a year <= this value | — |
107+
67108### 2. Probe the collection structure
68109
69110``` bash
0 commit comments