Skip to content

Commit b8785d9

Browse files
authored
Merge pull request #25 from ali5ter/feature/archive-org-download
Add archive.org download support to download.py
2 parents f5acbe3 + c74570e commit b8785d9

3 files changed

Lines changed: 343 additions & 71 deletions

File tree

README.md

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,17 @@ so the shape of your library is version-controlled even if the contents are not.
2727
## Requirements
2828

2929
```bash
30-
pip3 install pymupdf
30+
pip3 install -r requirements.txt
3131
```
3232

33-
Python 3.10+. No other dependencies.
33+
Or install individually:
34+
35+
```bash
36+
pip3 install pymupdf # required for convert.py
37+
pip3 install internetarchive # required for archive.org downloads
38+
```
39+
40+
Python 3.10+.
3441

3542
---
3643

@@ -51,6 +58,10 @@ Python 3.10+. No other dependencies.
5158

5259
### 1. Download a collection
5360

61+
The source is auto-detected from the URL. Both modes share `--output-dir`, `--delay`, and `--dry-run`.
62+
63+
**World Radio History** — scrapes PDF links from an archive page:
64+
5465
```bash
5566
# Preview what would be downloaded
5667
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" --dry-run
@@ -64,6 +75,36 @@ python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
6475
--filter "1970" --output-dir collections/eti/pdfs
6576
```
6677

78+
**archive.org** — downloads files from a single archive.org item by identifier.
79+
Each issue typically has two PDF variants: a plain image PDF and a `_text.pdf` with an
80+
Abbyy OCR text layer. The `--pdf-format` flag controls which variant is downloaded
81+
(`text` is the default since `convert.py` extracts from the OCR layer):
82+
83+
```bash
84+
# Download all OCR PDFs from an archive.org item
85+
python3 download.py "https://archive.org/details/ElektorMagazine" \
86+
--output-dir collections/elektor/pdfs
87+
88+
# Download only issues from a specific decade
89+
python3 download.py "https://archive.org/details/ElektorMagazine" \
90+
--output-dir collections/elektor/pdfs \
91+
--year-from 1974 --year-to 1989
92+
93+
# Download image-only PDFs (no OCR layer)
94+
python3 download.py "https://archive.org/details/ElektorMagazine" \
95+
--pdf-format image --output-dir collections/elektor/pdfs
96+
97+
# Preview without downloading
98+
python3 download.py "https://archive.org/details/ElektorMagazine" \
99+
--year-from 1980 --dry-run
100+
```
101+
102+
| Flag | Description | Default |
103+
| --- | --- | --- |
104+
| `--pdf-format` | `text` (_text.pdf, OCR), `image` (plain PDF), `both` | `text` |
105+
| `--year-from` | Only download files with a year >= this value ||
106+
| `--year-to` | Only download files with a year <= this value ||
107+
67108
### 2. Probe the collection structure
68109

69110
```bash

0 commit comments

Comments
 (0)