Accuracy through Consistancy, Extractor Tests, and a Raw, Extracted Data Repo? #102

CadenKruckeberg · 2026-06-23T16:56:47Z

CadenKruckeberg
Jun 23, 2026

Having a metric of accuracy of the data used in Madgrades is of course valuable, but given the sheer amount of records (currently, over a million), manually verifying all the data is unrealistic. We could, at least, check consistency; manually exploring the discrepancies between the results of the extractor over time. This enables the verification of accuracy of the extractor and even of the conflicting records.

My initial idea was: format and wrangle the extracted data into a strict, defined, and consistent format and hash them, allowing for a notion of "correct" or not.

Keenan's idea:

Diff/Analytics Engine: A job or method to perform a "diff" across different module versions or new PDF file additions. This could be done simply by checking CSV files into GitHub and viewing the commit diffs. Generally, this is critical for understanding the impact of changes or new semester releases (e.g., if you modify a few lines of the extractor, you can see exactly which lines in the output changed, complete with analytics like "X classes removed/added").

I imagine this could kill two birds with one stone and serve as a data repository, updating with time. Depending on the format, it could also be extremely useful for anyone wanting to use the data for their own purposes. Maybe I am misinterpreting your description though, Keenan?

Maybe the hashes would be more useful as a test for extractor accuracy and a full data repo would be more effective for exploring data accuracy.

thekeenant · 2026-07-01T01:11:45Z

thekeenant
Jul 1, 2026
Maintainer

To your point about using hashes as a test for extractor accuracy: my concern with raw file hashing is that it's entirely binary. If a single character or comma changes anywhere in a million records, the hash breaks. It tells us something is different, but gives us zero clues about what or where the discrepancy is.

Instead of file hashing, my thought for the analytics side was to generate high-level data statistics with each commit or semester release. We could track macro metrics like:

Total number of grades/records per year
Total number of subjects (overall and per year)
Number of course offerings per subject per semester

We can then use these stats as a smoke test to prevent things from changing too drastically. If an extractor change suddenly drops the number of Math courses by 20%, or the global grade count spikes unexpectedly, the tool flags it immediately.

Combined with the Git-backed CSV repository, this gives us the best of both worlds. The stats flag when a systemic anomaly occurs, and the Git diff allows us to easily dive in and see the exact rows causing it. Plus, as you noted, the repo doubles as a clean public dataset.

To make the Git diffs usable, we will definitely need to implement your idea of strict, consistent formatting so the rows are deterministically sorted and the history stays clean.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Madgrades

Accuracy through Consistancy, Extractor Tests, and a Raw, Extracted Data Repo? #102

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Madgrades

Accuracy through Consistancy, Extractor Tests, and a Raw, Extracted Data Repo? #102

Uh oh!

CadenKruckeberg Jun 23, 2026

Replies: 1 comment

Uh oh!

thekeenant Jul 1, 2026 Maintainer

CadenKruckeberg
Jun 23, 2026

thekeenant
Jul 1, 2026
Maintainer