🚀 The most complete, professionally curated dataset for Unicode 17.0 and Emoji 17.0
🚀 最完整、专业整理的 Unicode 17.0 和 Emoji 17.0 数据集
📖 Overview | 概述
This repository provides the absolute most complete and professionally processed dataset for Unicode 17.0 and Emoji 17.0, directly derived from official UCD and Unihan sources. It includes every allocated character, all control codes, 66 noncharacters, surrogate/private use boundaries, and the full Emoji 17.0 data (fully‑qualified, minimally‑qualified, unqualified, component). Perfect for font developers, linguists, text processing researchers, and anyone who needs the ultimate Unicode reference.
本仓库提供 绝对最完整、专业处理 的 Unicode 17.0 和 Emoji 17.0 数据集,直接源自官方 UCD 和 Unihan 源文件。 包含 每一个已分配字符、所有控制符、66个非字符、代理区/私用区边界 以及 完整的 Emoji 17.0 数据(全限定、最小限定、未限定、组件)。 适用于字体开发者、语言学家、文本处理研究人员以及任何需要终极 Unicode 参考的人。
✨ Core Features | 核心特性
| 🇬🇧 English | 🇨🇳 中文 |
|---|---|
| ✅ Absolute Completeness – All allocated characters, control chars, 66 noncharacters, surrogate/private use boundaries | ✅ 绝对完整 – 所有已分配字符、控制字符、66个非字符、代理区/私用区边界 |
| ✅ Latest Version – Based on Unicode 17.0 (September 2025) | ✅ 最新版本 – 基于 Unicode 17.0(2025年9月发布) |
| ✅ Multiple Formats – Machine‑readable TSV, human‑readable text, single‑line continuous string, block‑separated versions | ✅ 多格式可用 – 机器可读 TSV、人类可读排版、单行连续字符串、按区块分行版 |
| ✅ Exact Classification – Every character labeled with official block names (Chinese/English) | ✅ 精确分类 – 每个字符按官方区块名标注(中英双语) |
| ✅ Full Emoji Coverage – All statuses: fully‑qualified, minimally‑qualified, unqualified, component | ✅ Emoji 全覆盖 – 所有状态:fully‑qualified、minimally‑qualified、unqualified、component |
| ✅ Pure Original – Control characters preserved as‑is, no filtering, no modification | ✅ 原汁原味 – 控制字符原样保留,无过滤,无修改 |
📁 Repository Contents | 仓库内容
unicode-17.0-txt-tsv-complete-dataset-with-emoji/
├── 📂 Emoji-17.0-Complete-Dataset/
│ ├── Introduction.txt
│ ├── emoji_17.0_full_human.txt
│ ├── emoji_17.0_full_machine.tsv
│ ├── emoji_17.0_single_human.txt
│ ├── emoji_17.0_single_machine.tsv
│ ├── emoji_17.0_string.txt
│ └── emoji_17.0_string_annotated.txt
├── 📂 Screenshots/
│ ├── emoji_17.0_full_human.jpg
│ ├── emoji_17.0_string.jpg
│ ├── unicode_17.0_human.jpg
│ └── unicode_17.0_string_blocked.jpg
├── 📂 Unicode-17.0-Complete-Dataset/
│ ├── Introduction.txt
│ ├── unicode_17.0_blocks.txt
│ ├── unicode_17.0_human.txt
│ ├── unicode_17.0_machine.tsv
│ ├── unicode_17.0_string.txt
│ └── unicode_17.0_string_blocked.txt
├── LICENSE
├── LICENSE_UNICODE
├── NOTICE.txt
├── README.md
└── checksum_sha256.txt
🚀 Quick Start | 快速开始
You can clone the repository or download the latest release:
你可以 克隆仓库 或 下载最新发布版本:
# Clone with HTTPS
git clone https://github.com/SeekDeeply/unicode-17.0-txt-tsv-complete-dataset-with-emoji.git
# Or download the latest release from:
https://github.com/SeekDeeply/unicode-17.0-txt-tsv-complete-dataset-with-emoji/releasesAfter obtaining the files, you can immediately start exploring the data using any text editor or programming language.
获取文件后,你可以立即使用任何文本编辑器或编程语言开始探索数据。
📊 Unicode 17.0 Dataset | Unicode 17.0 数据集
Located in /Unicode-17.0-Complete-Dataset/ | 位于 /Unicode-17.0-Complete-Dataset/
Three Versions for Different Needs | 三种版本满足不同需求
File Format Size Lines Best For unicode_17.0_machine.tsv TSV 5.2 MB 142,610 Program import / Database unicode_17.0_human.txt Text 10 MB ~570,000 Reading / Reference / Sharing unicode_17.0_string.txt Raw 512 KB 1 String processing / Testing unicode_17.0_string_blocked.txt Text ~512 KB ~330 Block‑separated exploration
📸 Screenshot Preview | 截图预览
Human‑Readable Version Block‑Separated Version

😊 Emoji 17.0 Dataset | Emoji 17.0 数据集
Located in /Emoji-17.0-Complete-Dataset/ | 位于 /Emoji-17.0-Complete-Dataset/
File Family | 文件家族
File Size Lines Purpose emoji_17.0_full_machine.tsv 428 KB 5,228 Full dataset (machine) emoji_17.0_full_human.txt 621 KB 26,358 Full dataset (human) emoji_17.0_single_machine.tsv 61 KB 1,400 Single‑codepoint (machine) emoji_17.0_single_human.txt 111 KB ~5,600 Single‑codepoint (human) emoji_17.0_string.txt 5.4 KB 1 Single‑line string emoji_17.0_string_annotated.txt 5.6 KB ~20 Annotated string
Status Breakdown | 状态分布
Status Count Description
✅ fully-qualified 3,953 Official RGI Emoji
📸 Screenshot Preview | 截图预览
Full Human‑Readable Version Single‑Line String Version

🔧 Data Integrity | 数据完整性验证
All files include SHA-256 checksums for verification. You can verify with:
所有文件均附带 SHA-256 校验值,可用以下命令验证:
# Linux / macOS
sha256sum -c checksum_sha256.txt
# Windows (PowerShell)
Get-Content checksum_sha256.txt | ForEach-Object {
$hash, $file = $_ -split '\s+',2
if ((Get-FileHash $file -Algorithm SHA256).Hash -eq $hash.ToUpper()) {
Write-Host "$file OK" -ForegroundColor Green
} else {
Write-Host "$file FAILED" -ForegroundColor Red
}
}📜 License | 许可证
This project is licensed under the Apache License 2.0 – see the LICENSE file for details. The raw data derived from the Unicode Character Database is licensed under the Unicode License v3 – see the LICENSE_UNICODE file for details. For the complete terms of use of Unicode data, please refer to the Unicode Terms of Use.
本项目采用 Apache License 2.0 许可证 – 详见 LICENSE 文件。 源自 Unicode 字符数据库的原始数据采用 Unicode License v3 许可 – 详见 LICENSE_UNICODE 文件。 关于 Unicode 数据的完整使用条款,请参考 Unicode 使用条款。
🙏 Acknowledgments | 致谢
· Unicode Consortium for the official UCD and Unihan data. · All contributors who helped refine and verify this dataset. · You, for your interest in high‑quality Unicode data! · Unicode 联盟 提供的官方 UCD 和 Unihan 数据。 · 所有帮助完善和验证本数据集的贡献者。 · 你,对高质量 Unicode 数据感兴趣的你!
📬 Contact & Contribution | 联系方式与贡献
· GitHub: SeekDeeply · Email: i9888888885@163.com · Issues: Open an issue · Discussions: GitHub Discussions · GitHub: SeekDeeply · 邮箱: i9888888885@163.com · 问题反馈: 提交 Issue · 讨论区: GitHub Discussions
⭐ If you find this dataset valuable, please give it a star — it helps others discover it!
⭐ 如果你觉得这个数据集有价值,欢迎给个星标 — 这能帮助更多人发现它!