Skip to content

severo/awesome-parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Parquet Awesome

Parquet Logo

Useful resources for using the Parquet format

Contents

Libraries

C GLib

  • Arrow GLib - A wrapper library for Arrow C++.
  • DuckDB - An in-process database library that supports reading and writing Parquet files.

C++

  • Apache Arrow C++ - A library with support for reading and writing Parquet files.
  • DuckDB C++ API - Internal DuckDB C++ API.
  • libcudf - A GPU-accelerated DataFrame library for tabular data processing.

Dart

Go

  • duckdb-go - DuckDB Go client.
  • parquet - Official Go implementation of Apache Arrow.
  • parsyl/parquet - A Go library for reading and writing Parquet files.

Java

  • cudf - Java bindings for cudf, to be able to process large amounts of data on a GPU.
  • duckdb-java - DuckDB Java/JDBC API.
  • hardwood - A minimal dependency implementation of Apache Parquet.
  • parquet-carpet - A Java library for serializing and deserializing Parquet files efficiently using Java records.
  • parquet-java - A Java implementation of the Parquet format, owned by the Apache Software Foundation.

JavaScript

  • duckdb-node-neo - DuckDB Node.js client.
  • duckdb-wasm - WebAssembly version of DuckDB.
  • hyparquet - A lightweight, dependency-free, pure JavaScript library for parsing Apache Parquet files.
  • parquet-wasm - WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow using the Rust parquet and arrow crates.

Julia

  • DuckDB - Official DuckDB Julia package.
  • Parquet.jl - Julia implementation of Parquet columnar file format reader.

.NET

PHP

Python

  • duckdb-python - DuckDB Python client.
  • fastparquet - A Python implementation of the Parquet columnar file format.
  • pyarrow - A Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with Pandas, NumPy, and other software in the Python ecosystem.
  • pylibcudf - A lightweight Cython interface to libcudf that provides near-zero overhead for GPU-accelerated data processing in Python.

R

  • arrow - The arrow package provides an Arrow C++ backend to dplyr, and access to the Arrow C++ library through familiar base R and tidyverse functions, or R6 classes.
  • duckdb-r - DuckDB R package.
  • nanoparquet - A reader and writer for a common subset of Parquet files.

Ruby

  • Red Parquet - The Ruby bindings of Apache Parquet, based on GObject Introspection.

Rust

  • datafusion - An extensible query engine written in Rust that can read/write Parquet files using SQL or a DataFrame API.
  • duckdb-rs - DuckDB Rust client.
  • parquet - The official Native Rust implementation of Apache Parquet, part of the Apache Arrow project.
  • Polars - A DataFrame interface on top of an OLAP Query Engine that supports reading and writing Parquet files, with bindings for Python.

Swift

VBA

  • duckdb-vba - Excel/VBA bridge for DuckDB, enabling users to read, query, transform, and export Parquet files directly from Excel through a native DLL bridge.

Tools

Command-line

  • DataFusion CLI - A single, dependency-free executable that can read and write Parquet files, with a SQL interface.
  • DuckDB CLI - A single, dependency-free executable that can read and write Parquet files, with a SQL interface.
  • nail - Command-line tool for analyzing, transforming, and exploring data files.
  • ODBC to Parquet - A command-line tool to query an ODBC data source and write the result into a parquet file.
  • parquet-cli - Java-based CLI tool for exploring parquet files.
  • parquet-cli-standalone - A JAR file for the parquet-cli tool which can be run without any dependencies.
  • parquet-grep - A CLI tool to search for strings in Parquet files.
  • parquet-tools - Python-based CLI tool for exploring parquet files (part of Apache Arrow).
  • Spark - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Desktop applications

  • DBConvert Streams - A desktop SQL tool for querying Parquet, CSV, JSON, databases, and S3-compatible storage in one workspace.
  • Munquet - A desktop tool to convert CSV files to Parquet.
  • Pink Parquet - A free and open-source, user-friendly viewer for Parquet files for Windows.
  • Tad - An application for viewing and analyzing tabular data sets.

Plugins

  • nf-parquet - A Nextflow plugin able to read and write parquet files.

Terminal UI

  • Datanomy - A terminal-based tool for visualizing a Parquet file's metadata and structure.
  • DataTUI - A keyboard-first terminal UI for exploring Parquet with tabs, sorting, filtering, SQL (Polars), and more.
  • parqeye - Peek inside Parquet files right from your terminal.
  • parquetlens - Parquet previewer with a csvlens-style TUI.
  • Tabiew - A lightweight TUI application to view and query tabular data files, such as CSV, TSV, and parquet.

Web

  • ChatDB - Online tools for viewing and converting from and to Parquet files.
  • DataConverter.io - Online tools for viewing, converting, and transforming Parquet files.
  • Datasette - A tool to explore datasets, with support for reading Parquet files.
  • DataStudio - Explore and visualize data, entirely in your browser.
  • GeoParquet Viewer - A table and map viewer for GeoParquet files in the browser.
  • Onyxia Data Explorer - A web-based tool to explore Parquet files in the browser.
  • Parquet File Visualizer - Claude-code generated parquet metadata visualizer that runs in your browser.
  • Parquet Viewer - View parquet files online.
  • Quak - A scalable data profiler for quickly scanning large tables.

Resources

Blogs

Documentation

  • Parquet - The specification for Apache Parquet and Apache Thrift definitions to read and write Parquet metadata.
  • Apache Parquet Documentation - The official documentation for Apache Parquet.

Educative resources

  • ssphub - Un atelier de l'Insee illustrant l'utilisation des données du recensement 🇫🇷 diffusées au format Parquet.

Parquet engineering

Tests

Related formats

  • F3 - A data file format that is designed with efficiency, interoperability, and extensibility in mind.
  • GeoParquet - Specification for storing geospatial vector data (point, line, polygon) in Parquet.
  • Iceberg - A high-performance format for huge analytic tables that supports Parquet as one of its storage formats.
  • Lance - Modern columnar data format for ML and LLMs.
  • Nimble - File format for storage of large columnar datasets.
  • ORC - Self-describing type-aware columnar file format designed for Hadoop workloads.
  • Vortex - A columnar file format designed for high-performance data processing.

Contributing

Contributions welcome! Read the contribution guidelines first.

About

Useful resources for using the Parquet format

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Contributors