Skip to main content
Skip to main content

Protobuf

Not supported in ClickHouse Cloud
InputOutputAlias

Description

The Protobuf format is the Protocol Buffers format.

This format requires an external format schema, which is cached between queries.

ClickHouse supports:

  • both proto2 and proto3 syntaxes.
  • Repeated/optional/required fields.

Example usage

Basic examples

Usage examples:

SELECT * FROM test.table FORMAT Protobuf SETTINGS format_schema = 'schemafile:MessageType'
cat protobuf_messages.bin | clickhouse-client --query "INSERT INTO test.table SETTINGS format_schema='schemafile:MessageType' FORMAT Protobuf"

Where the file schemafile.proto looks like this:

syntax = "proto3";

message MessageType {
  string name = 1;
  string surname = 2;
  uint32 birthDate = 3;
  repeated string phoneNumbers = 4;
};

To find the correspondence between table columns and fields of the Protocol Buffers' message type, ClickHouse compares their names. This comparison is case-insensitive and the characters _ (underscore) and . (dot) are considered as equal. If the types of a column and a field of the Protocol Buffers' message are different, then the necessary conversion is applied.

Nested messages are supported. For example, for the field z in the following message type:

message MessageType {
  message XType {
    message YType {
      int32 z;
    };
    repeated YType y;
  };
  XType x;
};

ClickHouse tries to find a column named x.y.z (or x_y_z or X.y_Z and so on).

Nested messages are suitable for input or output of a nested data structures.

Default values defined in a protobuf schema like the one that follows are not applied, rather the table defaults are used instead of them:

syntax = "proto2";

message MessageType {
  optional int32 result_per_page = 3 [default = 10];
}

If a message contains oneof and input_format_protobuf_oneof_presence is set, ClickHouse fills column that indicates which field of oneof was found.

syntax = "proto3";

message StringOrString {
  oneof string_oneof {
    string string1 = 1;
    string string2 = 42;
  }
}
CREATE TABLE string_or_string ( string1 String, string2 String, string_oneof Enum('no'=0, 'hello' = 1, 'world' = 42))  Engine=MergeTree ORDER BY tuple();
INSERT INTO string_or_string from INFILE '$CURDIR/data_protobuf/String1' SETTINGS format_schema='$SCHEMADIR/string_or_string.proto:StringOrString' FORMAT ProtobufSingle;
SELECT * FROM string_or_string
   ┌─────────┬─────────┬──────────────┐
   │ string1 │ string2 │ string_oneof │
   ├─────────┼─────────┼──────────────┤
1. │         │ string2 │ world        │
   ├─────────┼─────────┼──────────────┤
2. │ string1 │         │ hello        │
   └─────────┴─────────┴──────────────┘

Name of the column that indicates presence must be the same as the name of oneof. Nested messages are supported (see basic-examples). Allowed types are Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64, Enum, Enum8 or Enum16. Enum (as well as Enum8 or Enum16) must contain all oneof' possible tags plus 0 to indicate absence, string representations does not matter.

The setting input_format_protobuf_oneof_presence is disabled by default

ClickHouse inputs and outputs protobuf messages in the length-delimited format. This means that before every message its length should be written as a variable width integer (varint).

See also: how to read/write length-delimited protobuf messages in popular languages.

Using autogenerated schema

If you don't have an external Protobuf schema for your data, you can still output/input data in the Protobuf format using an autogenerated schema.

For example:

SELECT * FROM test.hits format Protobuf SETTINGS format_protobuf_use_autogenerated_schema=1

In this case, ClickHouse will autogenerate the Protobuf schema according to the table structure using function structureToProtobufSchema. It will then use this schema to serialize data in the Protobuf format.

You can also read a Protobuf file with the autogenerated schema. In this case it is necessary for the file to be created using the same schema:

$ cat hits.bin | clickhouse-client --query "INSERT INTO test.hits SETTINGS format_protobuf_use_autogenerated_schema=1 FORMAT Protobuf"

The setting format_protobuf_use_autogenerated_schema is enabled by default and applies if format_schema is not set.

You can also save autogenerated schema in the file during input/output using setting output_format_schema. For example:

SELECT * FROM test.hits format Protobuf SETTINGS format_protobuf_use_autogenerated_schema=1, output_format_schema='path/to/schema/schema.proto'

In this case autogenerated Protobuf schema will be saved in file path/to/schema/schema.capnp.

Drop protobuf cache

To reload the Protobuf schema loaded from format_schema_path use the SYSTEM DROP ... FORMAT CACHE statement.

SYSTEM DROP FORMAT SCHEMA CACHE FOR Protobuf

Format settings