Skip to content

Commit 46dd434

Browse files
viki-shclaude
andcommitted
Add comprehensive README documenting full project
Covers data pipeline (PDFs → Claude API → PostgreSQL → JSON), all 8 pages, project structure, tech stack, setup instructions, deployment, data types, and severity score calculation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent db479f3 commit 46dd434

1 file changed

Lines changed: 241 additions & 8 deletions

File tree

README.md

Lines changed: 241 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,175 @@
1-
# Privacy Jury
1+
# The Privacy Jury
22

3-
A public gallery of privacy violation cases and enforcement actions.
3+
A global registry of **771 data privacy enforcement cases** across 7 jurisdictions, totaling **$507M+** in fines. Browse cases, compare enforcement actions side-by-side, explore jurisdictions on an interactive map, and learn what privacy enforcement terms actually mean.
44

55
**Live site**: [jury.privacydev.org](https://jury.privacydev.org)
66

7+
---
8+
9+
## How It Works
10+
11+
The project has two halves: a **data pipeline** that extracts structured case data from legal PDFs using Claude AI, and a **React frontend** that presents it as an interactive gallery.
12+
13+
### Data Pipeline
14+
15+
```
16+
Google Drive (PDFs)
17+
|
18+
v
19+
Python Agent (agent.py) ---> Claude API (extracts structured fields)
20+
|
21+
v
22+
PostgreSQL Database (cases table)
23+
|
24+
v
25+
Export Script (export_to_frontend.py) ---> generatedCases.json
26+
|
27+
v
28+
React Frontend (static JSON import at build time)
29+
```
30+
31+
1. **Source documents** — Complaint filings, consent orders, compliance decisions, and penalty notices are collected from regulator websites and stored in [Google Drive](https://drive.google.com/drive/folders/1j3XpwO0N2ttEjjVin-x-pHpq3KT3gwYj), organized by jurisdiction
32+
2. **PDF processing**`files/agent.py` watches an inbox folder, extracts text from PDFs, and sends them to the Claude API with a structured extraction prompt
33+
3. **Claude extraction** — Claude parses each legal document and returns structured fields: company name, jurisdiction, violation types, legal bases, fines, impacted individuals, claims vs reality, regulatory findings, and more
34+
4. **Database storage** — Extracted data is stored in PostgreSQL with the full JSON payload
35+
5. **Frontend export**`files/export_to_frontend.py` reads the database, calculates derived fields (severity scores, fine displays), and exports everything to `src/data/generatedCases.json`
36+
6. **Static frontend** — The React app imports the JSON at build time. No runtime API calls or database connections
37+
38+
### Severity Score
39+
40+
Each case gets a deterministic severity rating (1-5) based on:
41+
- **Data sensitivity (1-3):** Health/biometric/children/financial = 3, Location/identity/credit = 2, Everything else = 1
42+
- **People impacted (0-2):** 1M+ = 2, 10K-999K = 1, <10K or unknown = 0
43+
- **Final score** = data + people, clamped to [1, 5]
44+
45+
---
46+
47+
## Pages
48+
49+
| Page | Route | Description |
50+
|------|-------|-------------|
51+
| **Cases** | `/` | Searchable, filterable grid of all 771 cases with jurisdiction, sector, violation type, and sort controls |
52+
| **Case Detail** | `/case/:id` | Full case breakdown — what they did, why they were wrong, claims vs reality, legal findings, outcome, attached PDFs |
53+
| **Compare** | `/compare` | Matrix view (patterns across jurisdictions/violations/sectors) and side-by-side comparison of up to 3 individual cases |
54+
| **Explore** | `/explore` | Interactive world map highlighting 7 jurisdictions; click a region to see its enforcement framework, key laws, and dataset statistics |
55+
| **Leaderboard** | `/leaderboard` | Rankings — top companies by fines, most active jurisdictions, most common violations and sectors |
56+
| **Learn** | `/learn` | Educational glossary explaining enforcement outcomes, violation types, and key legal concepts with cross-references |
57+
| **About** | `/about` | Project information and attribution |
58+
59+
---
60+
61+
## Jurisdictions Covered
62+
63+
| Jurisdiction | Abbreviation | Region |
64+
|---|---|---|
65+
| Federal Trade Commission | US FTC | United States |
66+
| California DOJ | CA DOJ | United States (California) |
67+
| Information Commissioner's Office | UK ICO | United Kingdom |
68+
| Personal Data Protection Commission | SG PDPC | Singapore |
69+
| General Data Protection Regulation | EU GDPR | European Union |
70+
| European Data Protection Board | EU EDPB | European Union |
71+
| Office of the Australian Information Commissioner | AU OAIC | Australia |
72+
73+
---
74+
775
## Tech Stack
876

9-
- React + TypeScript
10-
- Vite
11-
- shadcn-ui + Tailwind CSS
12-
- Recharts (data visualization)
77+
### Frontend
78+
- **React 18** + **TypeScript 5** — UI framework
79+
- **Vite 5** — Build tool and dev server
80+
- **Tailwind CSS 3** — Utility-first styling
81+
- **shadcn/ui** + **Radix UI** — Component library (50+ components)
82+
- **react-simple-maps** — Interactive SVG world map on Explore page
83+
- **Recharts** — Data visualization charts
84+
- **TanStack React Query** — Data fetching/caching
85+
- **React Router DOM** — Client-side routing
86+
- **Lucide React** — Icons
87+
88+
### Data Pipeline
89+
- **Python 3** — Scripting language
90+
- **Anthropic SDK** — Claude API for PDF extraction
91+
- **PostgreSQL** + **psycopg2** — Database
92+
- **Watchdog** — File system watcher for inbox processing
93+
- **python-dotenv** — Environment variable management
94+
95+
### Deployment
96+
- **GitHub Pages** — Static hosting
97+
- **GitHub Actions** — CI/CD (auto-deploys on push to main)
98+
- **Custom domain** — jury.privacydev.org
99+
100+
---
101+
102+
## Project Structure
103+
104+
```
105+
PrivacyGallery/
106+
├── src/
107+
│ ├── pages/ # Page components
108+
│ │ ├── Index.tsx # Home — case gallery with search/filter/sort
109+
│ │ ├── CaseDetail.tsx # Individual case breakdown
110+
│ │ ├── Compare.tsx # Matrix & side-by-side case comparison
111+
│ │ ├── Explore.tsx # Interactive jurisdiction map
112+
│ │ ├── Leaderboard.tsx # Rankings and statistics
113+
│ │ ├── Learn.tsx # Educational glossary
114+
│ │ ├── About.tsx # Project information
115+
│ │ └── NotFound.tsx # 404 page
116+
│ ├── components/
117+
│ │ ├── TopNav.tsx # Yellow navigation bar
118+
│ │ ├── CaseCard.tsx # Case summary card with red fine stamp
119+
│ │ ├── ControlBar.tsx # Filter/sort controls
120+
│ │ ├── SearchBar.tsx # Search input
121+
│ │ ├── JurisdictionMap.tsx # SVG world map (react-simple-maps)
122+
│ │ ├── JurisdictionDetail.tsx# Jurisdiction stats and info panel
123+
│ │ ├── JurisdictionLogos.tsx # Jurisdiction logo/icon display
124+
│ │ ├── ScrollToTop.tsx # Scroll reset on route change
125+
│ │ └── ui/ # 50+ shadcn/ui components
126+
│ ├── data/
127+
│ │ ├── cases.ts # Type definitions, utilities, case loading
128+
│ │ ├── generatedCases.json # 771 cases exported from PostgreSQL
129+
│ │ ├── jurisdictionInfo.ts # Jurisdiction metadata (laws, authorities)
130+
│ │ └── glossary.ts # Learn page glossary content
131+
│ ├── hooks/ # Custom React hooks
132+
│ ├── lib/utils.ts # Tailwind class merge utility
133+
│ ├── main.tsx # App entry point
134+
│ └── index.css # Global styles and CSS variables
135+
├── files/ # Data pipeline (Python)
136+
│ ├── agent.py # PDF → Claude API → PostgreSQL
137+
│ ├── export_to_frontend.py # PostgreSQL → generatedCases.json
138+
│ ├── fill_case_source_url.py # Enrich cases with source URLs
139+
│ ├── fill_company_worth.py # Enrich cases with company valuations
140+
│ ├── revise_what_why.py # Refine case descriptions via Claude
141+
│ ├── reset_and_run.py # Reset DB and reprocess all PDFs
142+
│ ├── run_subset.py # Process a subset for testing
143+
│ ├── queries.sql # Example SQL queries
144+
│ ├── requirements.txt # Python dependencies
145+
│ ├── .env.example # Environment variable template
146+
│ └── inbox/ # PDF drop folders by jurisdiction
147+
│ ├── Australia - OAIC/
148+
│ ├── EU/GDPR/
149+
│ ├── Singapore - PDPC/
150+
│ ├── UK - ICO/
151+
│ └── US FTC/
152+
├── scripts/
153+
│ └── ingest-drive.mjs # Google Drive → generatedCases.json
154+
├── public/
155+
│ ├── logos/ # Jurisdiction logos
156+
│ ├── CNAME # Custom domain config
157+
│ └── favicon.svg
158+
├── .github/workflows/
159+
│ └── deploy.yml # GitHub Actions → GitHub Pages
160+
├── package.json
161+
├── vite.config.ts
162+
├── tailwind.config.ts
163+
└── tsconfig.json
164+
```
165+
166+
---
13167

14168
## Getting Started
15169

16-
Requires Node.js & npm — [install with nvm](https://github.com/nvm-sh/nvm#installing-and-updating)
170+
### Frontend
171+
172+
Requires Node.js 20+ — [install with nvm](https://github.com/nvm-sh/nvm#installing-and-updating)
17173

18174
```sh
19175
git clone https://github.com/AISmithLab/PrivacyGallery.git
@@ -22,6 +178,83 @@ npm install
22178
npm run dev
23179
```
24180

181+
The site runs at `http://localhost:8080`. Case data is already included in `generatedCases.json`.
182+
183+
### Data Pipeline (optional)
184+
185+
Only needed if you want to process new PDFs or rebuild the dataset.
186+
187+
```sh
188+
cd files
189+
pip install -r requirements.txt
190+
cp .env.example .env
191+
# Edit .env with your Anthropic API key and PostgreSQL connection string
192+
```
193+
194+
**Process PDFs:**
195+
```sh
196+
python agent.py # Watches inbox/ for new PDFs
197+
```
198+
199+
**Export to frontend:**
200+
```sh
201+
python export_to_frontend.py # Writes src/data/generatedCases.json
202+
```
203+
204+
### Environment Variables
205+
206+
| Variable | Description |
207+
|----------|-------------|
208+
| `ANTHROPIC_API_KEY` | Claude API key for PDF extraction |
209+
| `DATABASE_URL` | PostgreSQL connection string |
210+
| `MAX_PDFS` | Limit PDFs processed per run (default: 10) |
211+
| `WATCH_DIR` | Custom inbox directory path |
212+
| `DONE_DIR` | Custom processed directory path |
213+
| `ERROR_DIR` | Custom error directory path |
214+
215+
---
216+
25217
## Deployment
26218

27-
The site auto-deploys to GitHub Pages on every push to `main` via GitHub Actions.
219+
The site auto-deploys to GitHub Pages on every push to `main` via GitHub Actions. The workflow:
220+
221+
1. Checks out code
222+
2. Installs Node.js 20 and dependencies
223+
3. Builds with `npm run build`
224+
4. Copies `index.html` to `404.html` for SPA routing
225+
5. Deploys to GitHub Pages
226+
227+
Custom domain configured via `public/CNAME``jury.privacydev.org`
228+
229+
---
230+
231+
## Key Data Types
232+
233+
```typescript
234+
interface EnforcementCase {
235+
id: string;
236+
company: string;
237+
sector: Sector; // Technology, Healthcare, Finance, etc.
238+
jurisdiction: Jurisdiction; // US FTC, UK ICO, EU GDPR, etc.
239+
year: number;
240+
fineAmount: number;
241+
fineDisplay: string;
242+
violations: ViolationType[]; // Misrepresentation, Failure to disclose, etc.
243+
severityForIndividuals: number; // 1-5 calculated score
244+
impactedIndividuals: string;
245+
whatTheyDid: string; // Plain-language summary
246+
whyTheyWereWrong: string; // Why it matters
247+
claimsVsReality: ClaimVsReality[]; // What they said vs what they did
248+
regulatoryFindings: RegulatoryFinding[];
249+
outcome: string;
250+
outcomeSummary: string; // Complaint Filed, Consent Order, etc.
251+
attachedPDFs: AttachedPDF[]; // Links to source documents
252+
// ... 40+ additional fields
253+
}
254+
```
255+
256+
---
257+
258+
## License
259+
260+
This project is maintained by [AISmithLab](https://github.com/AISmithLab).

0 commit comments

Comments
 (0)