Skip to content

Commit 9a2e2cd

Browse files
authored
chore: add prettier (#24)
1 parent ff50d8f commit 9a2e2cd

10 files changed

Lines changed: 198 additions & 1323 deletions

File tree

.github/workflows/ci.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,8 @@ jobs:
1212
runs-on: ubuntu-latest
1313
steps:
1414
- uses: actions/checkout@v5
15-
- uses: DavidAnson/markdownlint-cli2-action@v20
15+
- uses: actions/setup-node@v4
16+
with:
17+
node-version: 22
18+
- run: npm ci
19+
- run: npm run lint

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
.cache
2+
.venv
23
node_modules

.markdownlint-cli2.jsonc

Lines changed: 0 additions & 7 deletions
This file was deleted.

.prettierrc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"proseWrap": "always"
3+
}

README.md

Lines changed: 29 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,37 @@
11
# stac-geoparquet
22

3-
<!-- markdownlint-disable-next-line MD033 -->
43
<img src="./img/stac-geoparquet.png" alt="The stac-geoparquet logo" width=200 />
54

6-
A specification for storing [SpatioTemporal Asset Catalog (STAC)](https://stacspec.org) items in [GeoParquet](https://geoparquet.org/).
7-
The specification lives at <https://github.com/radiantearth/stac-geoparquet-spec/blob/main/stac-geoparquet-spec.md>.
5+
A specification for storing
6+
[SpatioTemporal Asset Catalog (STAC)](https://stacspec.org) items in
7+
[GeoParquet](https://geoparquet.org/). The specification lives at
8+
<https://github.com/radiantearth/stac-geoparquet-spec/blob/main/stac-geoparquet-spec.md>.
89

9-
> [!WARNING]
10-
> The **stac-geoparquet** specification is under development, and has not yet been released as a stable v1.
11-
> See [this milestone](https://github.com/radiantearth/stac-geoparquet-spec/milestone/1) to track progress towards a stable release.
10+
> [!WARNING] The **stac-geoparquet** specification is under development, and has
11+
> not yet been released as a stable v1. See
12+
> [this milestone](https://github.com/radiantearth/stac-geoparquet-spec/milestone/1)
13+
> to track progress towards a stable release.
1214
1315
## Motivation
1416

15-
The STAC spec defines a JSON-based schema.
16-
But it can be hard to manage and search through many millions of STAC items in JSON format.
17-
For one, JSON is very large on disk.
18-
And you need to parse the entire JSON data into memory to extract just a small piece of information, say the `datetime` and one `asset` of an Item.
19-
20-
GeoParquet can be a good complement to JSON for many bulk-access and analytic use cases.
21-
While STAC Items are commonly distributed as individual JSON files on object storage or through a [STAC API](https://github.com/radiantearth/stac-api-spec), STAC GeoParquet allows users to access a large number of STAC items in bulk without making repeated HTTP requests.
22-
23-
For analytic questions like "find the items in the Sentinel-2 collection in June 2024 over New York City with cloud cover of less than 20%" it can be much, much faster to find the relevant data from a GeoParquet source than from JSON, because GeoParquet needs to load only the relevant columns for that query, not the full data.
17+
The STAC spec defines a JSON-based schema. But it can be hard to manage and
18+
search through many millions of STAC items in JSON format. For one, JSON is very
19+
large on disk. And you need to parse the entire JSON data into memory to extract
20+
just a small piece of information, say the `datetime` and one `asset` of an
21+
Item.
22+
23+
GeoParquet can be a good complement to JSON for many bulk-access and analytic
24+
use cases. While STAC Items are commonly distributed as individual JSON files on
25+
object storage or through a
26+
[STAC API](https://github.com/radiantearth/stac-api-spec), STAC GeoParquet
27+
allows users to access a large number of STAC items in bulk without making
28+
repeated HTTP requests.
29+
30+
For analytic questions like "find the items in the Sentinel-2 collection in June
31+
2024 over New York City with cloud cover of less than 20%" it can be much, much
32+
faster to find the relevant data from a GeoParquet source than from JSON,
33+
because GeoParquet needs to load only the relevant columns for that query, not
34+
the full data.
2435

2536
## Development
2637

@@ -53,8 +64,9 @@ uv run check-jsonschema --schemafile json-schema/metadata.json example-metadata.
5364

5465
## History
5566

56-
The **stac-geoparquet** specification was split from the [stac-utils repository](https://github.com/stac-utils/stac-geoparquet) in October 2025.
57-
The **git** history was preserved via the following command:
67+
The **stac-geoparquet** specification was split from the
68+
[stac-utils repository](https://github.com/stac-utils/stac-geoparquet) in
69+
October 2025. The **git** history was preserved via the following command:
5870

5971
```sh
6072
git filter-repo --subdirectory-filter=spec --path LICENSE --path README.md --path docs/drawbacks.md

docs/drawbacks.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,11 @@ Trying to represent STAC data in GeoParquet has some drawbacks.
44

55
## Unable to represent undefined values
66

7-
Parquet is unable to represent the difference between _undefined_ and _null_, and so is unable to perfectly round-trip STAC data with _undefined_ values.
7+
Parquet is unable to represent the difference between _undefined_ and _null_,
8+
and so is unable to perfectly round-trip STAC data with _undefined_ values.
89

9-
In JSON a value can have one of three states: defined, undefined, or null. The `"b"` key in the next three examples illustrates this:
10+
In JSON a value can have one of three states: defined, undefined, or null. The
11+
`"b"` key in the next three examples illustrates this:
1012

1113
Defined:
1214

@@ -34,7 +36,10 @@ Null:
3436
}
3537
```
3638

37-
Because Parquet is a columnar format, it is only able to represent undefined at the _column_ level. So if those three JSON items above were converted to Parquet, the column `"b"` would exist because it exists in the first and third item, and the second item would have `"b"` inferred as `null`:
39+
Because Parquet is a columnar format, it is only able to represent undefined at
40+
the _column_ level. So if those three JSON items above were converted to
41+
Parquet, the column `"b"` would exist because it exists in the first and third
42+
item, and the second item would have `"b"` inferred as `null`:
3843

3944
| a | b |
4045
| --- | ----- |

docs/schema.md

Lines changed: 46 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,59 @@
11
# Schema considerations
22

3-
A STAC Item is a JSON object to describe an external geospatial dataset. The STAC specification defines a common core, plus a variety of extensions. Additionally, STAC Items may include custom extensions outside the common ones. Crucially, the majority of the specified fields in the core spec and extensions define optional keys. Those keys often differ across STAC collections and may even differ within a single collection across items.
4-
5-
STAC's flexibility is a blessing and a curse. The flexibility of schemaless JSON allows for very easy writing as each object can be dumped separately to JSON. Every item is allowed to have a different schema. And newer items are free to have a different schema than older items in the same collection. But this write-time flexibility makes it harder to read as there are no guarantees (outside STAC's few required fields) about what fields exist.
6-
7-
Parquet is the complete opposite of JSON. Parquet has a strict schema that must be known before writing can start. This puts the burden of work onto the writer instead of the reader. Reading Parquet is very efficient because the file's metadata defines the exact schema of every record. This also enables use cases like reading specific columns that would not be possible without a strict schema.
8-
9-
This conversion from schemaless to strict-schema is the difficult part of converting STAC from JSON to GeoParquet, especially for large input datasets like STAC that are often larger than memory.
3+
A STAC Item is a JSON object to describe an external geospatial dataset. The
4+
STAC specification defines a common core, plus a variety of extensions.
5+
Additionally, STAC Items may include custom extensions outside the common ones.
6+
Crucially, the majority of the specified fields in the core spec and extensions
7+
define optional keys. Those keys often differ across STAC collections and may
8+
even differ within a single collection across items.
9+
10+
STAC's flexibility is a blessing and a curse. The flexibility of schemaless JSON
11+
allows for very easy writing as each object can be dumped separately to JSON.
12+
Every item is allowed to have a different schema. And newer items are free to
13+
have a different schema than older items in the same collection. But this
14+
write-time flexibility makes it harder to read as there are no guarantees
15+
(outside STAC's few required fields) about what fields exist.
16+
17+
Parquet is the complete opposite of JSON. Parquet has a strict schema that must
18+
be known before writing can start. This puts the burden of work onto the writer
19+
instead of the reader. Reading Parquet is very efficient because the file's
20+
metadata defines the exact schema of every record. This also enables use cases
21+
like reading specific columns that would not be possible without a strict
22+
schema.
23+
24+
This conversion from schemaless to strict-schema is the difficult part of
25+
converting STAC from JSON to GeoParquet, especially for large input datasets
26+
like STAC that are often larger than memory.
1027

1128
## Full scan over input data
1229

13-
The most foolproof way to convert STAC JSON to GeoParquet is to perform a full scan over input data. This is done automatically by [`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow] when a schema is not provided.
30+
The most foolproof way to convert STAC JSON to GeoParquet is to perform a full
31+
scan over input data. This is done automatically by
32+
[`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow]
33+
when a schema is not provided.
1434

15-
This is time consuming as it requires two full passes over the input data: once to infer a common schema and again to actually write to Parquet (though items are never fully held in memory, allowing this process to scale).
35+
This is time consuming as it requires two full passes over the input data: once
36+
to infer a common schema and again to actually write to Parquet (though items
37+
are never fully held in memory, allowing this process to scale).
1638

1739
## User-provided schema
1840

19-
Alternatively, the user can pass in an Arrow schema themselves using the `schema` parameter of [`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow]. This `schema` must match the on-disk schema of the the STAC JSON data.
41+
Alternatively, the user can pass in an Arrow schema themselves using the
42+
`schema` parameter of
43+
[`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow].
44+
This `schema` must match the on-disk schema of the the STAC JSON data.
2045

2146
## Multiple schemas per collection
2247

23-
It is also possible to write multiple Parquet files with STAC data where each Parquet file may have a different schema. This simplifies the conversion and writing process but makes reading and using the Parquet data harder.
48+
It is also possible to write multiple Parquet files with STAC data where each
49+
Parquet file may have a different schema. This simplifies the conversion and
50+
writing process but makes reading and using the Parquet data harder.
2451

2552
### Merging data with schema mismatch
2653

27-
If you've created STAC GeoParquet data where the schema has updated, you can use [`pyarrow.concat_tables`][pyarrow.concat_tables] with `promote_options="permissive"` to combine multiple STAC GeoParquet files.
54+
If you've created STAC GeoParquet data where the schema has updated, you can use
55+
[`pyarrow.concat_tables`][pyarrow.concat_tables] with
56+
`promote_options="permissive"` to combine multiple STAC GeoParquet files.
2857

2958
```py
3059
import pyarrow as pa
@@ -37,6 +66,9 @@ combined_table = pa.concat_tables([table1, table2], promote_options="permissive"
3766

3867
## Future work
3968

40-
Schema operations is an area where future work can improve reliability and ease of use of STAC GeoParquet.
69+
Schema operations is an area where future work can improve reliability and ease
70+
of use of STAC GeoParquet.
4171

42-
It's possible that in the future we could automatically infer an Arrow schema from the STAC specification's published JSON Schema files. If you're interested in this, open an issue and discuss.
72+
It's possible that in the future we could automatically infer an Arrow schema
73+
from the STAC specification's published JSON Schema files. If you're interested
74+
in this, open an issue and discuss.

0 commit comments

Comments
 (0)