Skip to content

fix(prepare): verify cache size integrity against server Content-Length#608

Open
aniruddhaadak80 wants to merge 1 commit into
karpathy:masterfrom
aniruddhaadak80:fix/issue-215-shard-cache-integrity
Open

fix(prepare): verify cache size integrity against server Content-Length#608
aniruddhaadak80 wants to merge 1 commit into
karpathy:masterfrom
aniruddhaadak80:fix/issue-215-shard-cache-integrity

Conversation

@aniruddhaadak80

Copy link
Copy Markdown

Performs size validation for cached dataset shards using HEAD requests to check server Content-Length, triggering redownloads if corrupted or incomplete. Also verifies downloaded size post-download (Fixes #215).

Copilot AI review requested due to automatic review settings June 13, 2026 04:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves shard download integrity by validating cached/downloaded parquet shard sizes against the remote Content-Length and redownloading when a mismatch is detected.

Changes:

  • Add HEAD-based validation for existing cached shards using Content-Length.
  • Verify streamed downloads match Content-Length before replacing the final file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread prepare.py
Comment on lines 58 to +60
filename = f"shard_{index:05d}.parquet"
filepath = os.path.join(DATA_DIR, filename)
url = f"{BASE_URL}/{filename}"
Comment thread prepare.py
Comment on lines +63 to +77
try:
response = requests.head(url, timeout=10)
response.raise_for_status()
content_length = int(response.headers.get("Content-Length", 0))
if content_length > 0:
local_size = os.path.getsize(filepath)
if local_size == content_length:
return True
else:
print(f" Cached {filename} is corrupted or incomplete ({local_size} vs expected {content_length} bytes). Redownloading...")
else:
return True
except Exception:
if os.path.getsize(filepath) > 0:
return True
Comment thread prepare.py
Comment on lines +90 to 94
if content_length > 0:
downloaded_size = os.path.getsize(temp_path)
if downloaded_size != content_length:
raise IOError(f"Truncated download: expected {content_length} bytes, got {downloaded_size} bytes")
os.rename(temp_path, filepath)
Comment thread prepare.py
Comment on lines 82 to 83
response = requests.get(url, stream=True, timeout=30)
response.raise_for_status()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden downloaded dataset shard cache in prepare.py

2 participants