ZEP 10 — Zarr Generic Extensions

Authors:

Status: Draft

Type: Specification

Created: 2025-05-12

Abstract

This proposal defines a new generic extension point, extensions, to be included in the metadata of Zarr v3 arrays and groups. The extensions field provides a consistent mechanism for attaching additional metadata that does not fit into existing extension points defined by the core specification. Extension entries within this field follow the naming and structure rules established in ZEP0009. This mechanism enables third parties to define and share metadata extensions without requiring changes to the core specification or introducing new top-level keys.

Introduction

Zarr specification version 3 currently defines four extension points, each associated with a specific (array) metadata field. Additional extension points may be added by future ZEPs. Until that time, however, third-parties may want to add arbitrary extension objects to either arrays or groups. This proposal introduces a generic extensions field that serves as a container for such a list of extensions.

These general purpose extensions are not limited by the scopes of existing extension points and require no heavy-weight process to add functionality or alter behavior of arrays and groups. The intent is to facilitate decentralized and low-friction innovation within the Zarr ecosystem by enabling third parties to experiment with new features without requiring immediate changes to the core specification. By tolerating a broader range of experimental extensions, the community can explore diverse use cases and patterns. Over time, widely adopted extensions may serve as the foundation for future standardization through new ZEPS which introduce new extension points or even core features.

Proposal

To provide for more flexible, immediate, and de-centralized use cases, we propose to add a generic extension point extensions on both arrays and groups into which extensions MAY be added.

This field is similar in flexibility to the attribues field. Conceptually, extensions is intended primarily for use by software and automated processes, with the potential to influence behavior or processing logic, whereas attributes are generally intended for human interpretation and serve as passive metadata or provenance information, though the boundaries are not always distinct.

By adding a new field, the specification can assert restrictions that if added to attributes. would amount to a breaking change. If present, the extensions field MUST contain an array of extension definitions. The contained array MUST either have one or more extensions or the object MUST be omitted entirely. Specifying metadata within extensions as opposed to attributes allows the clear registration of the extension name, providing a namespace for the metadata to prevent collisions, and activates the must_understand handling logic.

Further details on the specification changes can be found in https://github.com/zarr-developers/zarr-specs/pull/344.

Definition and naming

Each extension object will follow the rules laid out in the “Zarr extensions” section of the v3 specification.

Processing

Zarr implementers are expected to inspect the extensions for each node and determine whether each listed extension is supported. If an extension includes "must_understand": true (the default) and the implementation does not support it, the node must not be loaded and an appropriate error should be raised. For extensions with "must_understand": false, implementers may safely ignore unrecognized entries.

To support a given extension, an implementation many hard-code a check for known extension names and invoke appropriate logic according to the extension’s specification at the correct point in its processing pipeline (e.g., during metadata interpretation, data access, or layout resolution). Where possible, however, implementations are encouraged, to delegate that logic via a callback or plugin mechanism that allows third-party code to handle the extension dynamically.

As the set of extensions evolves, certain interfaces may arise which allow this modular approach for a subset of extensions. Where possible, these interfaces will be added to the specification. Feedback from implementers on such matters is highly encouraged.

Examples

The following examples represent a few realistic use cases of the top-level extensions container. This ZEP is putting in place the mechanism so the community can experiment with such extensions before their standardization.

Offset (array)

{
    "zarr_format": 3,
    "node_type": "array",
    ...,
    "extensions": [
        {
            "name": "example.offset",
            "configuration": { "offset": [ 12, 24 ] }
        }
    ]
}

The example.offset extension contains an array of the same order as the shape of the containing array specifying which element of the array should be considered as the origin, e.g., [0, 0]. This allows the reuse of subregions of an array without the need to rewrite the data.

Note that in this example of the extension is must_understand=true meaning an implementation which does not support the example.offset extension should raise an error.

Statistics (array)

{
    "zarr_format": 3,
    "node_type": "array",
    ...,
    "extensions": [
        {
            "name": "example.array-statistics",
            "must_understand": false,
            "configuration": {
                "min": 5,
                "max": 1023
            }
        }
    ]
}

The example.array-statistics extension contains two fields – min and max specifying the range of values which are present in the array, reducing the need to read every byte. must_understand is false, so implementations can safely ignore the extension.

Skip empty chunks (array)

{
    "zarr_format": 3,
    ...,
    "extensions": [
        "example.skip_empty_chunks"
    ]
}

Currently the “write_empty_chunks” flag in zarr-python is not propagated to the zarr.json file. An extension like example.skip_empty_chunks could serve as a no-configuration flag in the metadata to inform implementations that empty chunks should not be written.

Multiscale arrays (group)

{
    "zarr_format": 3,
    "node_type": "group",
    ...,
    "extensions": [
        {
            "name": "example.multiscale-arrays",
            "must_understand": false,
            "configuration": {
                "multiscale": {
                    "datasets": [
                        "path/to/array/1",
                        "path/to/array/2",
                        "path/to/array/3"
                    ]
                }
            }
        }
    ],
}

Metadata is introduced in the example.multiscale-arrays extension which allows encoding a relationship between multiple arrays at the group level. This defines a “multiscale pyramid” of arrays which is a common idiom in both the geospatial and bioimaging uses of Zarr. Implementations may choose to return a different subclass or backend when detecting such metadata. In this case, a “datatree” which allows similar operations on all levels of the pyramid might be preferred.

Tiered storage (group)

{
    "zarr_format": 3,
    "node_type": "group",
    ...,
    "extensions": [
        {
            "name": "example.tiered-storage",
            "must_understand": false,
            "configuration": {
                "slow-arrays": [
                    "path/to/array/1"
                ]
            }
        }
    ],
}

Related to the multiscales example above, an example.tiered-storage extension could identify arrays within a group which have been put on slower or even archived filesystems which will encourage more overhead and potentially costs if they are accessed. An implementation might warn users before opening the array.

Application to sub-nodes

This ZEP does not try to define the behavior for application to sub-nodes itself, but defers this to actual extensions.

Conceptually, we propose that extensions defined on groups may be valid for their child nodes. However, the details of how an implementation should identify which extensions are active within an hierarchy are unclear. Relying on traversing the hierarchy towards the root node is undesirable from a performance point of view.

As a workaround, extension authors can choose to write some metadata within the contained subgroups and arrays to make this easier. Options for what this metadata could be include:

  1. A copy of the metadata
{
  "extensions":  [
    {
      "name": "example.my-extension",
      "configuration": { ... full copy of the metadata ...}
    }
  ]

}
  1. A reference to the metadata as part of the extension itself
{
  "extensions":  [
    {
      "name": "example.my-extension",
      "configuration": {
        "reference": "../.."
      }
    }
  ]

}
  1. A complimentary reference extension
{
  "extensions":  [
    {
      "name": "example.my-extension-ref",
      "configuration": {
        "reference": "../.."
      }
    }
  ]

}
  1. A shared or even core reference extension
{
  "extensions":  [
    {
      "name": "example.parent-ref",
      "configuration": {
        "reference": "../.."
      }
    }
  ]
}

As further experience is gained by the community of extension authors, one or more of these methods may be adopted into the core spec.

Alternatives for the extensions extension point

The current design allows having the same extension definition syntax across all extension points and reduces pollution of the top-level namespace in a zarr.json. Thus, the addition of top-level metadata keys remains reserved to changes in the core spec. This MAY happen as part of the core spec adopting functionality of an extension.

Alternative designs that were considered are listed below along with their pros and cons.

Top-level metadata keys

Instead of a generic extension point, new top-level extension keys could be added to the metadata::

{
    "zarr_format": 3,
    ...
    "example.offset": { "offset": [ 12 ] },
    "example.array-statistics": {
        "min": 5,
        "max": 12
    },
    "example.consolidated-metadata": {
        "must_understand": false,
        ...
    }, // optional extension
    ...
}

In this case, there would be no explicit configuration key within an extension definition, but instead all the keys of such a configuration would be in the object itself. Using an object rather than directly for example an array of values would allow for evolution of the extension.

This would mean, however, that there are two separate types of extension definitions, i.e. {"name":"<name>", "configuration": {...}} in specialized extension points (e.g. codecs) and "<name>": {...} for other extensions.

A benefit would be that if an extension becomes adopted into the core spec, implementations would not need to be updated to support their move from the extensions object.

Simple extensions object

Instead of an array that holds the extension definitions, an object could alternatively be used::

{
    "zarr_format": 3,
    ...
    "extensions": {
        "example.offset": { "offset": [ 12 ] },
        "example.array-statistics": {
            "min": 5,
            "max": 12
        },
        "example.consolidated-metadata": {
            "must_understand": false,
            ...
        } // optional extension
    },
    ...
}

This alternative is similar to the top-level keys, with mostly the same implications.

This alternative would continue to reserve the top-level namespace for changes to the core spec and, therefore, reduce pollution of the top-level namespace. Downsides include that only a single use of each extension would be possible since the key is the extension name and there would be no ordering of the extensions.

Complex extensions object

Finally, a more complex extensions object could be defined::

{
    "zarr_format": 3,
    ...
    "extensions": {
        "version": 1,
        "contents": [
            {
                "name": "example.offset",
                "configuration": { "offset": [ 12 ] }
            },
            {
                "name": "example.array-statistics",
                "configuration: {
                    "min": 5,
                    "max": 12
                }
            },
            {
                "name": "example.consolidated-metadata",
                "must_understand": false,
                "configuration": {
                    ...
                }
            }
        ]
    },
    ...
}

This strategy combines the object strategy for extensibility with the uniformity of using a list of extension definitions, at the cost of a more complex object to parse.

Changelog

  • 2025-05-12: Migrate phase 2 of the original ZEP9

This proposal is licensed under the Apache License, Version 2.0.