Protocol Buffer Schema Generation¶

pyrmute can generate Protocol Buffer schemas for all your model versions. Protocol Buffers (protobuf) are widely used for gRPC services, microservices communication, and efficient binary serialization. This guide covers protobuf schema generation, type mapping, and integration with common protobuf tooling.

Note

Protocol Buffers use .proto files to define message structures rather than separate schema files like JSON Schema or Avro. Throughout this guide, we use "schema" to refer to these .proto definitions for consistency with the rest of pyrmute's API and documentation.

Why Protocol Buffers?¶

Protocol Buffers are used for:

gRPC services - High-performance RPC framework
Microservices - Language-agnostic service communication
Binary serialization - Compact, efficient data representation
API definitions - Strongly-typed service contracts

Protocol Buffers vs JSON Schema:

Feature	JSON Schema	Protocol Buffers
Use case	API validation	Service contracts, serialization
Schema evolution	Manual	Built-in backward compatibility
Size	Larger (text)	Compact (binary)
Performance	Slower	Faster
Ecosystem	OpenAPI, REST APIs	gRPC, microservices
Type safety	Runtime validation	Compile-time generation

Basic Protocol Buffer Schema Generation¶

Generate a Protocol Buffer schema for any registered model:

from pydantic import BaseModel, Field
from pyrmute import ModelManager

manager = ModelManager()


@manager.model("User", "1.0.0")
class UserV1(BaseModel):
    """User account information."""
    name: str = Field(description="User's full name")
    email: str = Field(description="User's email address")
    age: int = Field(ge=0, le=150, description="User's age in years")


# Generate Protocol Buffer schema
proto_schema = manager.get_proto_schema("User", "1.0.0", package="com.myapp")
print(proto_schema)

Output:

syntax = "proto3";

package com.myapp;

// User account information.
message User {
  // User's full name
  string name = 1;
  // User's email address
  string email = 2;
  // User's age in years
  uint32 age = 3;
}

Note that the integer field was optimized to uint32 based upon the Pydantic constraints.

Using the Generated Schema¶

The schema is returned as a string, ready to use:

# Get schema as string
proto_schema = manager.get_proto_schema("User", "1.0.0", package="com.myapp")

# Write to file
from pathlib import Path
Path("user.proto").write_text(proto_schema)

# Or compile directly with protoc
import subprocess
subprocess.run(
    ["protoc", "--python_out=.", "user.proto"],
    check=True
)

# Or pass to stdin for dynamic compilation
result = subprocess.run(
    ["protoc", "--python_out=.", "--descriptor_set_out=user.desc", "-"],
    input=proto_schema.encode(),
    capture_output=True
)

Type Mapping¶

Basic Types¶

Python types are automatically mapped to Protocol Buffer types:

Python Type	Proto2 Type	Proto3 Type
`str`	`string`	`string`
`int`	`int32`	`int32`
`float`	`double`	`double`
`bool`	`bool`	`bool`
`bytes`	`bytes`	`bytes`

@manager.model("BasicTypes", "1.0.0")
class BasicTypesV1(BaseModel):
    name: str          # -> string
    count: int         # -> int32
    price: float       # -> double
    active: bool       # -> bool
    data: bytes        # -> bytes

Well-Known Types¶

Special Python types use Protocol Buffer well-known types:

Python Type	Protobuf Type	Import Required
`datetime`	`google.protobuf.Timestamp`	Yes
`UUID`	`string`	No
`Decimal`	`double`	No

from datetime import datetime
from uuid import UUID
from decimal import Decimal


@manager.model("Event", "1.0.0")
class EventV1(BaseModel):
    event_id: UUID          # -> string
    timestamp: datetime     # -> google.protobuf.Timestamp
    amount: Decimal         # -> double

Generated proto:

syntax = "proto3";

package com.myapp;

import "google/protobuf/timestamp.proto";

message Event {
  string event_id = 1;
  google.protobuf.Timestamp timestamp = 2;
  double amount = 3;
}

Collection Types¶

Lists and maps are supported:

@manager.model("Collections", "1.0.0")
class CollectionsV1(BaseModel):
    tags: list[str]           # -> repeated string
    scores: list[int]         # -> repeated int32
    metadata: dict[str, str]  # -> map<string, string>
    counts: dict[str, int]    # -> map<string, int32>

Generated proto:

message Collections {
  repeated string tags = 1;
  repeated int32 scores = 2;
  map<string, string> metadata = 3;
  map<string, int32> counts = 4;
}

Optional Fields¶

Proto3 behavior:

Optional fields use the optional keyword for explicit presence tracking:

@manager.model("User", "1.0.0")
class UserV1(BaseModel):
    name: str                 # -> string (required in Python, no label in proto3)
    email: str | None = None  # -> optional string (optional in Python)
    age: int | None = None    # -> optional int32 (optional in Python)

Generated proto3:

syntax = "proto3";

message User {
  string name = 1;
  optional string email = 2;
  optional int32 age = 3;
}

Fields without optional are implicitly optional but lack presence tracking (can't distinguish between unset and default value).

Proto2 behavior:

Proto2 uses explicit required and optional labels:

schema = manager.get_proto_schema("User", "1.0.0", use_proto3=False)

Generated proto2:

syntax = "proto2";

message User {
  required string name = 1;
  optional string email = 2;
  optional int32 age = 3;
}

Presence tracking: The optional keyword in proto3 enables field presence detection, allowing you to distinguish between a field that was explicitly set to its default value versus one that was never set.

Enum Types¶

Python Enums map to top-level Protocol Buffer enums:

from enum import StrEnum


class Status(StrEnum):
    PENDING = "pending"
    ACTIVE = "active"
    COMPLETED = "completed"


@manager.model("Task", "1.0.0")
class TaskV1(BaseModel):
    name: str
    status: Status

Generated proto:

syntax = "proto3";

package com.myapp;

// Status appears as a top-level enum
enum Status {
  PENDING = 0;
  ACTIVE = 1;
  COMPLETED = 2;
}

// Task references the top-level enum
message Task {
  string name = 1;
  Status status = 2;
}

Why top-level? Top-level enums can be shared across multiple messages and are easier to reference from other proto files, making them more reusable in larger service architectures.

Union Types¶

Union types become oneof in Protocol Buffers:

@manager.model("Notification", "1.0.0")
class NotificationV1(BaseModel):
    notification_id: str
    content: str | int  # Union type

Generated proto:

message Notification {
  string notification_id = 1;
  oneof content_value {
    // content when type is str
    string content_string = 2;
    // content when type is int
    int32 content_int32 = 3;
  }
}

Optional unions:

@manager.model("Flexible", "1.0.0")
class FlexibleV1(BaseModel):
    value: str | int | None  # Optional union

Generated proto:

message Flexible {
  oneof value_value {
    // value when type is str
    string value_string = 1;
    // value when type is int
    int32 value_int32 = 2;
  }
}

Unions with nested models:

@manager.model("CardPayment", "1.0.0")
class CardPaymentV1(BaseModel):
    card_number: str
    cvv: str


@manager.model("BankPayment", "1.0.0")
class BankPaymentV1(BaseModel):
    account_number: str
    routing_number: str


@manager.model("Payment", "1.0.0")
class PaymentV1(BaseModel):
    payment_id: str
    method: CardPaymentV1 | BankPaymentV1  # Union of models

Generated proto:

syntax = "proto3";

package com.myapp;

// Top-level payment method messages
message CardPayment {
  string card_number = 1;
  string cvv = 2;
}

message BankPayment {
  string account_number = 1;
  string routing_number = 2;
}

// Payment with oneof referencing top-level messages
message Payment {
  string payment_id = 1;
  oneof method_value {
    // method when type is CardPayment
    CardPayment method_cardpayment = 2;
    // method when type is BankPayment
    BankPayment method_bankpayment = 3;
  }
}

Note that the oneof field names use the registry names (method_cardpayment, method_bankpayment) rather than the Python class names (CardPaymentV1, BankPaymentV1).

Nested Messages¶

Pydantic models that reference other models become top-level messages in the proto file:

@manager.model("Address", "1.0.0")
class AddressV1(BaseModel):
    street: str
    city: str
    zip_code: str


@manager.model("User", "1.0.0")
class UserV1(BaseModel):
    name: str
    address: AddressV1

Generated proto:

syntax = "proto3";

package com.myapp;

// Address appears as a top-level message
message Address {
  string street = 1;
  string city = 2;
  string zip_code = 3;
}

// User references Address
message User {
  string name = 1;
  Address address = 2;
}

Why top-level? This makes models independently referenceable and reusable across different schemas, which is ideal for schema registries and service-to-service communication. Each model can be versioned and evolved independently.

Protocol Buffer Packages¶

Protocol Buffers use packages to organize schemas, similar to namespaces in other languages. Pyrmute uses a consistent package name across all versions.

# Same package for all versions
schema_v1 = manager.get_proto_schema("User", "1.0.0", package="com.mycompany")
# package: "com.mycompany"

schema_v2 = manager.get_proto_schema("User", "2.0.0", package="com.mycompany")
# package: "com.mycompany"

Best practices:

Use reverse domain notation: com.company.service
Keep packages consistent across versions
Use subpackages for logical grouping
Examples: com.acme.users, com.acme.orders, org.example.api

Note: Unlike some other systems, protobuf packages should NOT include version numbers. Versioning is handled through message names or file organization.

Proto2 vs Proto3¶

Choose the Protocol Buffer syntax version:

# Proto3 (recommended for new projects)
schema = manager.get_proto_schema(
    "User", "1.0.0",
    package="com.myapp",
    use_proto3=True  # Default
)

# Proto2 (for legacy systems)
schema = manager.get_proto_schema(
    "User", "1.0.0",
    package="com.myapp",
    use_proto3=False
)

Key differences:

Feature	Proto2	Proto3
Required fields	Supported	Not supported
Default values	Custom defaults	Type defaults (0, "", false)
Presence tracking	Optional	Limited (use `optional` keyword)
Unknown fields	Preserved	Preserved
Recommendation	Legacy only	Modern projects

When to use proto2:

Maintaining existing proto2 services
Need explicit required fields
Custom default values required

When to use proto3:

New projects (recommended)
Simpler syntax
Better forward compatibility

Exporting Protocol Buffer Schemas¶

Export All Schemas¶

Export protobuf schemas for all registered models:

manager.dump_proto_schemas(
    "schemas/protos/",
    package="com.mycompany"
)

Creates files like:

schemas/protos/
├── User_v1_0_0.proto
├── User_v2_0_0.proto
├── Order_v1_0_0.proto
└── Product_v1_0_0.proto

Export Options¶

Customize the export:

manager.dump_proto_schemas(
    "schemas/protos/",
    package="com.mycompany.events",
    include_comments=True,   # Include field descriptions
    use_proto3=True          # Use proto3 syntax
)

Export Without Documentation¶

For production schemas without documentation overhead:

manager.dump_proto_schemas(
    "schemas/protos/",
    package="com.mycompany",
    include_comments=False  # Omit comments
)

gRPC Integration¶

Define a Service¶

from pydantic import BaseModel


@manager.model("GetUserRequest", "1.0.0")
class GetUserRequestV1(BaseModel):
    user_id: str


@manager.model("GetUserResponse", "1.0.0")
class GetUserResponseV1(BaseModel):
    user_id: str
    name: str
    email: str


# Export schemas
proto_schemas = manager.dump_proto_schemas(
    "protos/", package="com.myapp.users"
)

# Or get individual schema as string
request_schema = manager.get_proto_schema(
    "GetUserRequest", "1.0.0", package="com.myapp.users"
)

# Write to file
Path("protos/user_service.proto").write_text(request_schema)

Manually add service definition to the generated proto:

syntax = "proto3";

package com.myapp.users;

message GetUserRequest {
  string user_id = 1;
}

message GetUserResponse {
  string user_id = 1;
  string name = 2;
  string email = 3;
}

// Add this service definition
service UserService {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
}

Compile Protocol Buffers¶

Use protoc to generate code:

# Generate Go code
protoc --go_out=. --go-grpc_out=. protos/*.proto

# Generate Java code
protoc --java_out=. protos/*.proto

Use Generated Code¶

package main

import (
    "context"
    "log"

    pb "github.com/myapp/protos"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
)

func main() {
    // Connect to gRPC server
    conn, err := grpc.Dial("localhost:50051", grpc.WithTransportCredentials(insecure.NewCredentials()))
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }
    defer conn.Close()

    // Create client
    client := pb.NewUserServiceClient(conn)

    // Create request
    request := &pb.GetUserRequest{
        UserId: "123",
    }

    // Call service
    response, err := client.GetUser(context.Background(), request)
    if err != nil {
        log.Fatalf("GetUser failed: %v", err)
    }

    log.Printf("User: %s (%s)", response.Name, response.Email)
}

Language Interoperability¶

Protocol Buffers work across languages:

Python to Go¶

Python (producer):

@manager.model("Event", "1.0.0")
class EventV1(BaseModel):
    event_id: str
    timestamp: datetime
    data: dict[str, str]

# Get schema as string
proto_schema = manager.get_proto_schema("Event", "1.0.0", package="com.events")

# Write to file for Go to compile
Path("protos/event.proto").write_text(proto_schema)

Go (consumer):

// Generate Go code
// protoc --go_out=. protos/Event_v1_0_0.proto

import pb "github.com/myapp/protos"

event := &pb.Event{
    EventId: "evt_123",
    Timestamp: timestamppb.Now(),
    Data: map[string]string{"key": "value"},
}

Python to Java¶

Python:

manager.dump_proto_schemas("protos/", package="com.myapp")

Java:

# Generate Java code
protoc --java_out=src/main/java protos/*.proto

import com.myapp.EventProtos.Event;

Event event = Event.newBuilder()
    .setEventId("evt_123")
    .setTimestamp(Timestamp.newBuilder().setSeconds(System.currentTimeMillis() / 1000))
    .putData("key", "value")
    .build();

Schema Evolution Best Practices¶

Backward Compatible Changes¶

Add new fields with field numbers that don't conflict:

# Version 1
@manager.model("Product", "1.0.0")
class ProductV1(BaseModel):
    id: str
    name: str
    price: float


# Version 2: Add optional fields
@manager.model("Product", "2.0.0")
class ProductV2(BaseModel):
    id: str
    name: str
    price: float
    description: str | None = None  # Backward compatible
    category: str | None = None     # Backward compatible

Generated proto (v2):

message Product {
  string id = 1;
  string name = 2;
  double price = 3;
  optional string description = 4;  // New field
  optional string category = 5;     // New field
}

Old clients can read new messages (ignore unknown fields).

Reserved Fields¶

When removing fields, reserve their numbers:

# Don't do this - reuses field number
@manager.model("Product", "2.0.0")
class ProductV2(BaseModel):
    id: str
    name: str
    new_field: str  # Don't reuse the field number!

Instead, document reserved numbers in comments:

message Product {
  // reserved 3;  // Was: price (removed in v2)
  string id = 1;
  string name = 2;
  string new_field = 4;
}

Field Numbering¶

Best practices:

Never change field numbers
Don't reuse field numbers
Reserve 1-15 for frequently used fields (1-byte encoding)
Use 16+ for less frequent fields

Real-World Examples¶

Microservice API¶

from datetime import datetime
from enum import StrEnum
from uuid import UUID


class OrderStatus(StrEnum):
    PENDING = "pending"
    CONFIRMED = "confirmed"
    SHIPPED = "shipped"
    DELIVERED = "delivered"


@manager.model("CreateOrderRequest", "1.0.0")
class CreateOrderRequestV1(BaseModel):
    """Request to create a new order."""
    customer_id: str
    items: list[str]
    total_amount: float


@manager.model("CreateOrderResponse", "1.0.0")
class CreateOrderResponseV1(BaseModel):
    """Response with created order details."""
    order_id: UUID
    status: OrderStatus
    created_at: datetime


# Export for gRPC service
manager.dump_proto_schemas("protos/", package="com.shop.orders")

Generated proto for CreateOrderResponse_v1_0_0.proto:

syntax = "proto3";

package com.shop.orders;

import "google/protobuf/timestamp.proto";

// OrderStatus enum is top-level
enum OrderStatus {
  PENDING = 0;
  CONFIRMED = 1;
  SHIPPED = 2;
  DELIVERED = 3;
}

// Response with created order details.
message CreateOrderResponse {
  string order_id = 1;
  OrderStatus status = 2;
  google.protobuf.Timestamp created_at = 3;
}

This self-contained schema can be registered in a schema registry as a single subject, with all dependencies (the enum) included at the top level.

API Gateway¶

@manager.model("ApiRequest", "1.0.0")
class ApiRequestV1(BaseModel):
    """Generic API request wrapper."""
    request_id: UUID
    timestamp: datetime
    endpoint: str
    method: str
    headers: dict[str, str]
    body: bytes | None = None


@manager.model("ApiResponse", "1.0.0")
class ApiResponseV1(BaseModel):
    """Generic API response wrapper."""
    request_id: UUID
    status_code: int
    headers: dict[str, str]
    body: bytes
    duration_ms: int

Common Patterns¶

Versioned Messages¶

# V1: Basic order
@manager.model("Order", "1.0.0")
class OrderV1(BaseModel):
    order_id: str
    total: float


# V2: Add customer info
@manager.model("Order", "2.0.0")
class OrderV2(BaseModel):
    order_id: str
    total: float
    customer_id: str | None = None
    customer_email: str | None = None


# Both versions coexist
# Clients choose which version to use

Polymorphic Messages¶

Use oneof for polymorphic data:

@manager.model("Notification", "1.0.0")
class NotificationV1(BaseModel):
    notification_id: str
    timestamp: datetime
    # Use union for polymorphic content
    content: str | dict[str, str]  # Text or structured

Pagination¶

@manager.model("ListUsersRequest", "1.0.0")
class ListUsersRequestV1(BaseModel):
    page_size: int = 50
    page_token: str | None = None


@manager.model("ListUsersResponse", "1.0.0")
class ListUsersResponseV1(BaseModel):
    users: list[dict[str, str]]  # Simplified for example
    next_page_token: str | None = None
    total_count: int

Schema Testing¶

Validate Generated Schemas¶

import subprocess


def test_proto_schema_validity() -> None:
    """Test that generated schemas are valid protobuf."""
    manager.dump_proto_schemas("test_protos/", package="com.test")

    # Validate with protoc
    result = subprocess.run(
        ["protoc", "--syntax_only", "test_protos/*.proto"],
        capture_output=True
    )

    assert result.returncode == 0, f"Invalid proto: {result.stderr}"


def test_proto_compilation() -> None:
    """Test that schemas can be compiled."""
    manager.dump_proto_schemas("test_protos/", package="com.test")

    # Compile to Python
    result = subprocess.run(
        ["protoc", "--python_out=.", "test_protos/*.proto"],
        capture_output=True
    )

    assert result.returncode == 0

Test Serialization¶

def test_protobuf_roundtrip() -> None:
    """Test data can be serialized and deserialized."""
    # Generate schema
    manager.dump_proto_schemas("test_protos/", package="com.test")

    # Compile
    subprocess.run(["protoc", "--python_out=.", "test_protos/User_v1_0_0.proto"])

    # Import generated code
    from test_protos import user_v1_0_0_pb2

    # Create message
    user = user_v1_0_0_pb2.User(
        name="Alice",
        email="alice@example.com",
        age=30
    )

    # Serialize
    serialized = user.SerializeToString()

    # Deserialize
    user2 = user_v1_0_0_pb2.User()
    user2.ParseFromString(serialized)

    assert user2.name == "Alice"

Troubleshooting¶

Import Errors¶

If well-known types aren't found:

# Ensure google protobuf is installed
# pip install protobuf

# When compiling, include proto path
# protoc -I=/usr/include --python_out=. protos/*.proto

Field Number Conflicts¶

Avoid reusing field numbers:

# Bad: Reusing field number 3
@manager.model("Product", "2.0.0")
class ProductV2(BaseModel):
    id: str        # field 1
    name: str      # field 2
    category: str  # field 3 (was price in v1) - BAD!

Compilation Errors¶

Check syntax version matches usage:

# If using proto3 features in proto2 file
protoc: syntax error

# Solution: Use consistent syntax
manager.get_proto_schema("Model", "1.0.0", use_proto3=True)

Comparison with JSON Schema¶

Feature	JSON Schema	Protocol Buffers
Generation	`get_schema()`	`get_proto_schema()`
Export	`dump_schemas()`	`dump_proto_schemas()`
Syntax	JSON	Protobuf DSL
Code generation	No	Yes (protoc)
Binary format	No	Yes
Service definitions	No	Yes (with manual editing)
Use case	REST APIs	gRPC, microservices

Best Practices¶

Use proto3 for new projects - Modern syntax with better compatibility
Keep packages consistent - Don't version package names
Add field descriptions - Enable include_comments=True
Never reuse field numbers - Reserve removed field numbers
Test compilation - Validate schemas with protoc
Version control schemas - Keep .proto files in Git
Document service contracts - Add comments to generated files
Use well-known types - Leverage google.protobuf.Timestamp, etc.

Next Steps¶

Related topics:

Schema Generation - JSON Schema generation
Avro Schema Generation - Apache Avro schemas

External resources:

API Reference: