Awesome Parquet

Useful resources for using the Parquet format

Libraries

C GLib

Arrow GLib - A wrapper library for Arrow C++.
DuckDB - An in-process database library that supports reading and writing Parquet files.

C++

Apache Arrow C++ - A library with support for reading and writing Parquet files.
DuckDB C++ API - Internal DuckDB C++ API.
libcudf - A GPU-accelerated DataFrame library for tabular data processing.

Dart

DuckDB.Dart - DuckDB Dart bindings.

Go

duckdb-go - DuckDB Go client.
parquet - Official Go implementation of Apache Arrow.
parsyl/parquet - A Go library for reading and writing Parquet files.

Java

cudf - Java bindings for cudf, to be able to process large amounts of data on a GPU.
duckdb-java - DuckDB Java/JDBC API.
hardwood - A minimal dependency implementation of Apache Parquet.
parquet-carpet - A Java library for serializing and deserializing Parquet files efficiently using Java records.
parquet-java - A Java implementation of the Parquet format, owned by the Apache Software Foundation.

JavaScript

duckdb-node-neo - DuckDB Node.js client.
duckdb-wasm - WebAssembly version of DuckDB.
hyparquet - A lightweight, dependency-free, pure JavaScript library for parsing Apache Parquet files.
parquet-wasm - WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow using the Rust parquet and arrow crates.

Julia

DuckDB - Official DuckDB Julia package.
Parquet.jl - Julia implementation of Parquet columnar file format reader.

.NET

Parquet.Net - A fully managed Parquet library for .NET.
ParquetSharp - A .NET wrapper over the C++ Parquet library that integrates with .NET Arrow.

PHP

duckdb-php - DuckDB API for PHP.

Python

duckdb-python - DuckDB Python client.
fastparquet - A Python implementation of the Parquet columnar file format.
pyarrow - A Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with Pandas, NumPy, and other software in the Python ecosystem.
pylibcudf - A lightweight Cython interface to libcudf that provides near-zero overhead for GPU-accelerated data processing in Python.

R

arrow - The arrow package provides an Arrow C++ backend to dplyr, and access to the Arrow C++ library through familiar base R and tidyverse functions, or R6 classes.
duckdb-r - DuckDB R package.
nanoparquet - A reader and writer for a common subset of Parquet files.

Ruby

Red Parquet - The Ruby bindings of Apache Parquet, based on GObject Introspection.

Rust

datafusion - An extensible query engine written in Rust that can read/write Parquet files using SQL or a DataFrame API.
duckdb-rs - DuckDB Rust client.
parquet - The official Native Rust implementation of Apache Parquet, part of the Apache Arrow project.
Polars - A DataFrame interface on top of an OLAP Query Engine that supports reading and writing Parquet files, with bindings for Python.

Swift

duckdb-swift - DuckDB Swift client.

VBA

duckdb-vba - Excel/VBA bridge for DuckDB, enabling users to read, query, transform, and export Parquet files directly from Excel through a native DLL bridge.

Tools

Command-line

DataFusion CLI - A single, dependency-free executable that can read and write Parquet files, with a SQL interface.
DuckDB CLI - A single, dependency-free executable that can read and write Parquet files, with a SQL interface.
nail - Command-line tool for analyzing, transforming, and exploring data files.
ODBC to Parquet - A command-line tool to query an ODBC data source and write the result into a parquet file.
parquet-cli - Java-based CLI tool for exploring parquet files.
parquet-cli-standalone - A JAR file for the parquet-cli tool which can be run without any dependencies.
parquet-grep - A CLI tool to search for strings in Parquet files.
parquet-tools - Python-based CLI tool for exploring parquet files (part of Apache Arrow).
Spark - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Desktop applications

DBConvert Streams - A desktop SQL tool for querying Parquet, CSV, JSON, databases, and S3-compatible storage in one workspace.
Munquet - A desktop tool to convert CSV files to Parquet.
Pink Parquet - A free and open-source, user-friendly viewer for Parquet files for Windows.
Tad - An application for viewing and analyzing tabular data sets.

Plugins

nf-parquet - A Nextflow plugin able to read and write parquet files.

Terminal UI

Datanomy - A terminal-based tool for visualizing a Parquet file's metadata and structure.
DataTUI - A keyboard-first terminal UI for exploring Parquet with tabs, sorting, filtering, SQL (Polars), and more.
parqeye - Peek inside Parquet files right from your terminal.
parquetlens - Parquet previewer with a csvlens-style TUI.
Tabiew - A lightweight TUI application to view and query tabular data files, such as CSV, TSV, and parquet.

Web

ChatDB - Online tools for viewing and converting from and to Parquet files.
DataConverter.io - Online tools for viewing, converting, and transforming Parquet files.
Datasette - A tool to explore datasets, with support for reading Parquet files.
DataStudio - Explore and visualize data, entirely in your browser.
GeoParquet Viewer - A table and map viewer for GeoParquet files in the browser.
Onyxia Data Explorer - A web-based tool to explore Parquet files in the browser.
Parquet File Visualizer - Claude-code generated parquet metadata visualizer that runs in your browser.
Parquet Viewer - View parquet files online.
Quak - A scalable data profiler for quickly scanning large tables.

Resources

Blogs

icem7 - Un blog sur les outils de data science, avec des articles de fond sur Parquet.
Hyparquet: The Quest for Instant Data - 6 optimization tricks to read Parquet files faster in the browser.
Querying Parquet with Precision Using DuckDB - Describes how DuckDB optimizes queries to a Parquet file using projection & filter pushdown.
Why Parquet Is the Go-To Format for Data Engineers - A graphical description of the Parquet format with optimization and best practices.
Column Storage for the AI Era - A proposal by the creator of Parquet to better support AI workloads by adding encodings and metadata.
I spent 8 hours learning Parquet. Here’s what I discovered - A graphical description of the Parquet format.

Documentation

Parquet - The specification for Apache Parquet and Apache Thrift definitions to read and write Parquet metadata.
Apache Parquet Documentation - The official documentation for Apache Parquet.

Educative resources

ssphub - Un atelier de l'Insee illustrant l'utilisation des données du recensement 🇫🇷 diffusées au format Parquet.

Parquet engineering

Best Practices for Distributing GeoParquet - Best practices for making 'good' GeoParquet files, especially for distribution of data.
Handling Parquet Files - Recommendations about the row group size and the Parquet file sizes.
Les filtres de Bloom dans Parquet - Un article de fond sur les filtres de Bloom dans Parquet, utiles pour indexer des colonnes non triées, à forte cardinalité.
Tips for Writing Parquet Files - Tips for choosing the right parameters when writing Parquet files, such as the row group size and the number of row groups per file.

Tests

parquet-testing - Testing Data and Utilities for Apache Parquet.

Related formats

F3 - A data file format that is designed with efficiency, interoperability, and extensibility in mind.
GeoParquet - Specification for storing geospatial vector data (point, line, polygon) in Parquet.
Iceberg - A high-performance format for huge analytic tables that supports Parquet as one of its storage formats.
Lance - Modern columnar data format for ML and LLMs.
Nimble - File format for storage of large columnar datasets.
ORC - Self-describing type-aware columnar file format designed for Hadoop workloads.
Vortex - A columnar file format designed for high-performance data processing.

Contributing

Contributions welcome! Read the contribution guidelines first.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
assets		assets
LICENSE		LICENSE
README.md		README.md
code-of-conduct.md		code-of-conduct.md
contributing.md		contributing.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Parquet

Contents

Libraries

C GLib

C++

Dart

Go

Java

JavaScript

Julia

.NET

PHP

Python

R

Ruby

Rust

Swift

VBA

Tools

Command-line

Desktop applications

Plugins

Terminal UI

Web

Resources

Blogs

Documentation

Educative resources

Parquet engineering

Tests

Related formats

Contributing

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Parquet

Contents

Libraries

C GLib

C++

Dart

Go

Java

JavaScript

Julia

.NET

PHP

Python

R

Ruby

Rust

Swift

VBA

Tools

Command-line

Desktop applications

Plugins

Terminal UI

Web

Resources

Blogs

Documentation

Educative resources

Parquet engineering

Tests

Related formats

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!