This repository contains the open-source methodology and code I created and used to construct a longitudinal crosswalk (also included) between two federal education datasets:
- Integrated Postsecondary Education Data System (IPEDS)
- Title II (Sections 205–208) of the Higher Education Act (HEA)
IPEDS and Title II data systems capture different yet complementary characteristics of teacher preparation programs in the U.S. and U.S. Territories. Each data source uses different identifiers and do not directly speak to one another. I developed a transparent and reproducible crosswalk between these sources, allowing for more detailed analysis by institutional and geographic context of where teachers are trained, in what subjects. Below is a description of how I compiled, cleaned, harmonized and merged the data. Also included is guidance to facilitate reuse or adaptation of our methodology.
End-to-End Pipeline:
Title II (2012–2024) → Schema Harmonization → Crosswalk Matching (Exact + Fuzzy + Manual Validation) → Merge with IPEDS → Institution Lookup + Value Labels → Final Analytical Dataset
To reproduce the full dataset or adapt this methodology:
-
Run
notebooks/titleII_schema_harmonization.ipynb
→ Creates the longitudinal Title II dataset (2012–2024) -
Run
notebooks/titleII_ipeds_crosswalk.ipynb
→ Generates the initial Title II–IPEDS crosswalk (requires manual validation)
→ A finalized, validated crosswalk is provided at:
data/outputs/_FINAL_TitleII_with_IPEDS_Matched_UnitID_and_Fuzzy_Details.xlsx -
Run
notebooks/merge_all_years_with_crosswalk.ipynb
→ Produces the final merged dataset with institutional identifiers
→ This step uses the provided validated crosswalk:
_FINAL_TitleII_with_IPEDS_Matched_UnitID_and_Fuzzy_Details.xlsx -
(Optional) Enrich with institutional attributes
→ Use the institution lookup files in/data/outputs/to add geographic and institutional characteristics (Read:ipeds_institution_lookup.md)
All notebooks are designed to run in Google Colab or a local Python environment.
Here are more details about each data source used:
Data Used: IPEDS, [2023-24]*, General Information; Extract Date: July 2025
IPEDS is the U.S. Department of Education’s primary source of data on postsecondary institutions that participate in federal student aid programs. The IPEDS data used in this project include:
- Enrollment and completion counts by program of study
- Institutional characteristics (public/private control, sector, Land Grant, HBCU, or Tribal College status)
- Geographic identifiers including ZIP codes and county-level FIPS codes
These data provide essential institutional and geographic context but do not distinguish teacher preparation subfields or licensure outcomes.
*Note: At the time of this analysis, 33 institutions reported in Title II had closed; 26 of those institutions closed prior to the 2023-24 IPEDS report year. In those cases, I used the institution's final report data spanning 2011 - 2022 IPEDS General Information reports.
Data Used: Report Years 2012 - 2024. Extract Date: July 2025*
Title II reports provide information teacher prep programs that lead to certification or licensure, and consist of
- Data from 50 states, 1 federal district (Washington D.C), and 7 territories (American Somoa, Federal States of Micronesia, Guam, Marshall Islands, Northern Mariana Islands, Puerto Rico, and U.S. Virgin Islands).
- Program-level information by subject area (for example, early childhood, secondary math, special education)
- Distinctions between traditional and alternative route programs
- Characteristics on the demographic composition of candidates and completers
- Enrollment and completer counts
Title II contains limited information at the institutional level or any geographic identifier below the state level. Also, it does not contain a stable institutional identifier, such as an IPEDS Unit ID, that would facilitate direct merging.
*For this project, I accessed and downloaded unsuppressed Title II data prior to its removal from public access.
- The one exception is Massachusetts' data from Report Year 2021 (Academic Year 2019 - 2020), which was missing all but one university's data. I accessed and added to our data set the missing MA universities from the suppressed Title II report year 2021.
The crosswalk methodology is designed to pull directly from Title II files as they are currently released, even though some program-level details are now suppressed in newer versions. This ensures the code remains reproducible and adaptable while preserving the richer historical record captured in earlier downloads.
Our first challenge lay in the drastic change of the Title II reporting schema beginning with the 2019–20 reporting year. For purposes of sustaining consistency within a time series, all historical Title II data have been harmonized to conform to the 2024 Title II schema. This included:
- Renaming prior report year variables to match 2024 conventions
- Adding placeholder fields where variables did not exist historically
- Reconciling changes in how enrollment and completers were reported across years
The Appendix_TitleII_Column_Harmonization.xlsx summarizes the Title II variables from earlier reporting years and how they were mapped to their 2024 equivalents.
Because IPEDS and Title II do not share a common institutional identifier, the crosswalk is built using text matches of institution names, supported by systematic cleaning and fuzzy matching, as well as manual matching.
The following techniques were applied to standardize institution names in both datasets:
- Convert all texts to lowercase
- Remove punctuation, special characters, and extra spaces or any common format differences
- Expand common abbreviations (e.g., “st” → “saint”)
- Normalize spacing and hyphenation
Exact matches are attempted using:
- State
- Program type
- Standardized program name
When exact matches fail, fuzzy matching is applied using token-based similarity scoring:
- Similarity threshold: 80%
- Title II program names may match against either the IPEDS institution name or alias fields
- All fuzzy matches and unresolved cases are manually reviewed
- Helper fields document match source and confidence
- Program types expected not to have IPEDS UnitIDs (e.g., Alternative, non-IHE-based) are explicitly classified
- Programs expected not to map to IPEDS are retained and labeled accordingly
- Remaining unmatched records are isolated for transparency and further review
Across the final merged dataset (28,168 Title II program-year records merged to the IPEDS crosswalk of 3,340 institution–program combinations):
- 25,048 records successfully match to an IPEDS UnitID, with unmatched cases largely attributable to non-IHE programs or unresolved name variations
- A total of 2,705 records correspond to alternative, non-IHE-based programs, (or mislabeled program types) for which an IPEDS Unit ID is not expected and therefore intentionally left blank.
- The remaining 305 records (approximately 1.1% of all Title II rows) represent IHE-expected programs that could not be automatically matched due to persistent naming variation across years and reporting formats.
├── notebooks/
│ ├── titleII_schema_harmonization.ipynb
│ ├── optional_titleii_column_schema_audit.ipynb
│ ├── titleII_ipeds_crosswalk.ipynb
│ ├── merge_all_years_with_crosswalk.ipynb
│ └── ipeds_value_labeling.ipynb # Adds paired label columns to institution lookup
│
├── data/
│ ├── outputs/
│ ├── _FINAL_TitleII_with_IPEDS_Matched_UnitID_and_Fuzzy_Details.xlsx # Final validated Title II + IPEDS Crosswalk
│ ├── TitleII_IPEDS_Institution_Lookup.xlsx
│ ├── TitleII_IPEDS_Institution_Lookup_With_Labels.xlsx
│ └── IPEDS_Value_Labels_Lookup.xlsx
├── docs/
│ ├── ipeds_institution_lookup.md # How to use lookup + labeling files
│ ├── crosswalk_methodology.docx
│ ├── column_mapping_tables.md
│ ├── TitleII_2012_2024_Schema_Harmonization.xlsx
│ ├── TitleII_IPEDS_Data_Dictionary.xlsx
│ ├── 2013_TitleII_Data_Documentation.pdf
│ ├── 2018_TitleII_Data_Documentation.pdf
│ └── 2020_TitleII_Data_Documentation.pdf
│
├── CITATION.md
├── LICENSE
└── README.md
Note: Raw federal data files are not redistributed in this repository. All notebooks are designed to pull directly from official public URLs where possible.
Key outputs produced by this repository include:
- Longitudinal Title II dataset (2012–2024), harmonized to 2024 schema (Run:
titleII_schema_harmonization.ipynb) - Validated Title II–IPEDS crosswalk (initial output requires final manual validation; Run:
titleII_ipeds_crosswalk.ipynb) - Merged Title II + IPEDS dataset with institutional identifiers (Run:
merge_all_years_with_crosswalk.ipynb) - Audit tables identifying unmatched or non-IHE programs (Run:
merge_all_years_with_crosswalk.ipynb) - (Optional) Add institutional and geographic characteristics to the merged dataset (Read:
ipeds_institution_lookup.md; Run:ipeds_value_labeling.ipynb)
_FINAL_TitleII_with_IPEDS_Matched_UnitID_and_Fuzzy_Details.xlsx
This file contains the validated mapping between Title II programs and IPEDS institutions.
It includes:
- Matched IPEDS UnitIDs
- Match methodology indicators (exact, fuzzy, manual)
- Flags for expected non-matches
These files allow researchers to enrich the merged dataset with institutional and geographic characteristics:
-
TitleII_IPEDS_Institution_Lookup.xlsx
→ Base institutional reference table (coded values) -
TitleII_IPEDS_Institution_Lookup_With_Labels.xlsx
→ Includes human-readable label columns for IPEDS variables -
IPEDS_Value_Labels_Lookup.xlsx
→ Mapping between coded IPEDS values and labels
docs/TitleII_IPEDS_Data_Dictionary.xlsx
- Provides definitions and descriptions for variables included in the crosswalk and institution lookup datasets.
2013_TitleII_Data_Documentation.pdf, 2018_TitleII_Data_Documentation.pdf, 2020_TitleII_Data_Documentation.pdf
- Provides variable definitions for each tab of the Title II annual reports.
- We used these dictionaries to identify changes in variables across report years (e.g. naming conventions, additions, deletions, etc.)
docs/TitleII_2012_2024_Schema_Harmonization.xlsx
Documents how pre-2020 Title II variables were mapped to the 2024 schema.
To conduct analysis using this repository:
-
Use the merged dataset (from
merge_all_years_with_crosswalk.ipynb) as your base. -
Join institutional characteristics using the lookup file:
df = df.merge(df_lookup, on="UnitID", how="left")- Use _Label columns for interpretation and reporting (e.g., sector, control).
- Use ZIP and FIPS fields to merge with external datasets such as Census or labor market data.
This workflow enables analysis of teacher preparation programs by:
- Institution type
- Geography
- Program characteristics
- Demographic composition
- Title II does not include a stable institutional identifier, requiring text-based matching.
- Some institutions appear under multiple name variations across years.
- Approximately 305 records (~1.1%) remain unmatched due to persistent naming inconsistencies and had to be manually matched.
- Alternative, non-IHE-based programs (2,705 records) are not expected to map to IPEDS institutions.
- IPEDS does not distinguish teacher preparation subfields or licensure outcomes.
- Title II data after recent releases include suppressed fields, limiting comparability with earlier years.
Users should interpret unmatched records and crosswalk results accordingly.
All notebooks are designed to run in Google Colab or a local Python environment. Where possible, intermediate outputs are cached to reduce repeated downloads and ensure consistent results.
This project is released under the MIT License. You are free to use, adapt, modify, and redistribute the code and derived outputs with attribution. See the LICENSE file for details.
Quinn, S. (2026). Title II–IPEDS Crosswalk & Longitudinal Teacher Preparation Dataset (Version v2). [Online] Western Governors University - Craft Education. Zenodo. Available at https://github.com/stephanieandersonquinn-lab/title-ii-ipeds-crosswalk. DOI: https://doi.org/10.5281/zenodo.18676589