Accuracy through Consistancy, Extractor Tests, and a Raw, Extracted Data Repo? #102
Replies: 1 comment
-
|
To your point about using hashes as a test for extractor accuracy: my concern with raw file hashing is that it's entirely binary. If a single character or comma changes anywhere in a million records, the hash breaks. It tells us something is different, but gives us zero clues about what or where the discrepancy is. Instead of file hashing, my thought for the analytics side was to generate high-level data statistics with each commit or semester release. We could track macro metrics like:
We can then use these stats as a smoke test to prevent things from changing too drastically. If an extractor change suddenly drops the number of Math courses by 20%, or the global grade count spikes unexpectedly, the tool flags it immediately. Combined with the Git-backed CSV repository, this gives us the best of both worlds. The stats flag when a systemic anomaly occurs, and the Git diff allows us to easily dive in and see the exact rows causing it. Plus, as you noted, the repo doubles as a clean public dataset. To make the Git diffs usable, we will definitely need to implement your idea of strict, consistent formatting so the rows are deterministically sorted and the history stays clean. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Having a metric of accuracy of the data used in Madgrades is of course valuable, but given the sheer amount of records (currently, over a million), manually verifying all the data is unrealistic. We could, at least, check consistency; manually exploring the discrepancies between the results of the extractor over time. This enables the verification of accuracy of the extractor and even of the conflicting records.
My initial idea was: format and wrangle the extracted data into a strict, defined, and consistent format and hash them, allowing for a notion of "correct" or not.
Keenan's idea:
I imagine this could kill two birds with one stone and serve as a data repository, updating with time. Depending on the format, it could also be extremely useful for anyone wanting to use the data for their own purposes. Maybe I am misinterpreting your description though, Keenan?
Maybe the hashes would be more useful as a test for extractor accuracy and a full data repo would be more effective for exploring data accuracy.
Beta Was this translation helpful? Give feedback.
All reactions