Skip to content

Protocol Buffer Schema Generation

pyrmute can generate Protocol Buffer schemas for all your model versions. Protocol Buffers (protobuf) are widely used for gRPC services, microservices communication, and efficient binary serialization. This guide covers protobuf schema generation, type mapping, and integration with common protobuf tooling.

Note

Protocol Buffers use .proto files to define message structures rather than separate schema files like JSON Schema or Avro. Throughout this guide, we use "schema" to refer to these .proto definitions for consistency with the rest of pyrmute's API and documentation.

Why Protocol Buffers?

Protocol Buffers are used for:

  • gRPC services - High-performance RPC framework
  • Microservices - Language-agnostic service communication
  • Binary serialization - Compact, efficient data representation
  • API definitions - Strongly-typed service contracts

Protocol Buffers vs JSON Schema:

Feature JSON Schema Protocol Buffers
Use case API validation Service contracts, serialization
Schema evolution Manual Built-in backward compatibility
Size Larger (text) Compact (binary)
Performance Slower Faster
Ecosystem OpenAPI, REST APIs gRPC, microservices
Type safety Runtime validation Compile-time generation

Basic Protocol Buffer Schema Generation

Generate a Protocol Buffer schema for any registered model:

from pydantic import BaseModel, Field
from pyrmute import ModelManager

manager = ModelManager()


@manager.model("User", "1.0.0")
class UserV1(BaseModel):
    """User account information."""
    name: str = Field(description="User's full name")
    email: str = Field(description="User's email address")
    age: int = Field(ge=0, le=150, description="User's age in years")


# Generate Protocol Buffer schema
proto_schema = manager.get_proto_schema("User", "1.0.0", package="com.myapp")
print(proto_schema)

Output:

syntax = "proto3";

package com.myapp;

// User account information.
message User {
  // User's full name
  string name = 1;
  // User's email address
  string email = 2;
  // User's age in years
  uint32 age = 3;
}

Note that the integer field was optimized to uint32 based upon the Pydantic constraints.

Using the Generated Schema

The schema is returned as a string, ready to use:

# Get schema as string
proto_schema = manager.get_proto_schema("User", "1.0.0", package="com.myapp")

# Write to file
from pathlib import Path
Path("user.proto").write_text(proto_schema)

# Or compile directly with protoc
import subprocess
subprocess.run(
    ["protoc", "--python_out=.", "user.proto"],
    check=True
)

# Or pass to stdin for dynamic compilation
result = subprocess.run(
    ["protoc", "--python_out=.", "--descriptor_set_out=user.desc", "-"],
    input=proto_schema.encode(),
    capture_output=True
)

Type Mapping

Basic Types

Python types are automatically mapped to Protocol Buffer types:

Python Type Proto2 Type Proto3 Type
str string string
int int32 int32
float double double
bool bool bool
bytes bytes bytes
@manager.model("BasicTypes", "1.0.0")
class BasicTypesV1(BaseModel):
    name: str          # -> string
    count: int         # -> int32
    price: float       # -> double
    active: bool       # -> bool
    data: bytes        # -> bytes

Well-Known Types

Special Python types use Protocol Buffer well-known types:

Python Type Protobuf Type Import Required
datetime google.protobuf.Timestamp Yes
UUID string No
Decimal double No
from datetime import datetime
from uuid import UUID
from decimal import Decimal


@manager.model("Event", "1.0.0")
class EventV1(BaseModel):
    event_id: UUID          # -> string
    timestamp: datetime     # -> google.protobuf.Timestamp
    amount: Decimal         # -> double

Generated proto:

syntax = "proto3";

package com.myapp;

import "google/protobuf/timestamp.proto";

message Event {
  string event_id = 1;
  google.protobuf.Timestamp timestamp = 2;
  double amount = 3;
}

Collection Types

Lists and maps are supported:

@manager.model("Collections", "1.0.0")
class CollectionsV1(BaseModel):
    tags: list[str]           # -> repeated string
    scores: list[int]         # -> repeated int32
    metadata: dict[str, str]  # -> map<string, string>
    counts: dict[str, int]    # -> map<string, int32>

Generated proto:

message Collections {
  repeated string tags = 1;
  repeated int32 scores = 2;
  map<string, string> metadata = 3;
  map<string, int32> counts = 4;
}

Optional Fields

Proto3 behavior:

Optional fields use the optional keyword for explicit presence tracking:

@manager.model("User", "1.0.0")
class UserV1(BaseModel):
    name: str                 # -> string (required in Python, no label in proto3)
    email: str | None = None  # -> optional string (optional in Python)
    age: int | None = None    # -> optional int32 (optional in Python)

Generated proto3:

syntax = "proto3";

message User {
  string name = 1;
  optional string email = 2;
  optional int32 age = 3;
}

Fields without optional are implicitly optional but lack presence tracking (can't distinguish between unset and default value).

Proto2 behavior:

Proto2 uses explicit required and optional labels:

schema = manager.get_proto_schema("User", "1.0.0", use_proto3=False)

Generated proto2:

syntax = "proto2";

message User {
  required string name = 1;
  optional string email = 2;
  optional int32 age = 3;
}

Presence tracking: The optional keyword in proto3 enables field presence detection, allowing you to distinguish between a field that was explicitly set to its default value versus one that was never set.

Enum Types

Python Enums map to top-level Protocol Buffer enums:

from enum import StrEnum


class Status(StrEnum):
    PENDING = "pending"
    ACTIVE = "active"
    COMPLETED = "completed"


@manager.model("Task", "1.0.0")
class TaskV1(BaseModel):
    name: str
    status: Status

Generated proto:

syntax = "proto3";

package com.myapp;

// Status appears as a top-level enum
enum Status {
  PENDING = 0;
  ACTIVE = 1;
  COMPLETED = 2;
}

// Task references the top-level enum
message Task {
  string name = 1;
  Status status = 2;
}

Why top-level? Top-level enums can be shared across multiple messages and are easier to reference from other proto files, making them more reusable in larger service architectures.

Union Types

Union types become oneof in Protocol Buffers:

@manager.model("Notification", "1.0.0")
class NotificationV1(BaseModel):
    notification_id: str
    content: str | int  # Union type

Generated proto:

message Notification {
  string notification_id = 1;
  oneof content_value {
    // content when type is str
    string content_string = 2;
    // content when type is int
    int32 content_int32 = 3;
  }
}

Optional unions:

@manager.model("Flexible", "1.0.0")
class FlexibleV1(BaseModel):
    value: str | int | None  # Optional union

Generated proto:

message Flexible {
  oneof value_value {
    // value when type is str
    string value_string = 1;
    // value when type is int
    int32 value_int32 = 2;
  }
}

Unions with nested models:

@manager.model("CardPayment", "1.0.0")
class CardPaymentV1(BaseModel):
    card_number: str
    cvv: str


@manager.model("BankPayment", "1.0.0")
class BankPaymentV1(BaseModel):
    account_number: str
    routing_number: str


@manager.model("Payment", "1.0.0")
class PaymentV1(BaseModel):
    payment_id: str
    method: CardPaymentV1 | BankPaymentV1  # Union of models

Generated proto:

syntax = "proto3";

package com.myapp;

// Top-level payment method messages
message CardPayment {
  string card_number = 1;
  string cvv = 2;
}

message BankPayment {
  string account_number = 1;
  string routing_number = 2;
}

// Payment with oneof referencing top-level messages
message Payment {
  string payment_id = 1;
  oneof method_value {
    // method when type is CardPayment
    CardPayment method_cardpayment = 2;
    // method when type is BankPayment
    BankPayment method_bankpayment = 3;
  }
}

Note that the oneof field names use the registry names (method_cardpayment, method_bankpayment) rather than the Python class names (CardPaymentV1, BankPaymentV1).

Nested Messages

Pydantic models that reference other models become top-level messages in the proto file:

@manager.model("Address", "1.0.0")
class AddressV1(BaseModel):
    street: str
    city: str
    zip_code: str


@manager.model("User", "1.0.0")
class UserV1(BaseModel):
    name: str
    address: AddressV1

Generated proto:

syntax = "proto3";

package com.myapp;

// Address appears as a top-level message
message Address {
  string street = 1;
  string city = 2;
  string zip_code = 3;
}

// User references Address
message User {
  string name = 1;
  Address address = 2;
}

Why top-level? This makes models independently referenceable and reusable across different schemas, which is ideal for schema registries and service-to-service communication. Each model can be versioned and evolved independently.

Protocol Buffer Packages

Protocol Buffers use packages to organize schemas, similar to namespaces in other languages. Pyrmute uses a consistent package name across all versions.

# Same package for all versions
schema_v1 = manager.get_proto_schema("User", "1.0.0", package="com.mycompany")
# package: "com.mycompany"

schema_v2 = manager.get_proto_schema("User", "2.0.0", package="com.mycompany")
# package: "com.mycompany"

Best practices:

  • Use reverse domain notation: com.company.service
  • Keep packages consistent across versions
  • Use subpackages for logical grouping
  • Examples: com.acme.users, com.acme.orders, org.example.api

Note: Unlike some other systems, protobuf packages should NOT include version numbers. Versioning is handled through message names or file organization.

Proto2 vs Proto3

Choose the Protocol Buffer syntax version:

# Proto3 (recommended for new projects)
schema = manager.get_proto_schema(
    "User", "1.0.0",
    package="com.myapp",
    use_proto3=True  # Default
)

# Proto2 (for legacy systems)
schema = manager.get_proto_schema(
    "User", "1.0.0",
    package="com.myapp",
    use_proto3=False
)

Key differences:

Feature Proto2 Proto3
Required fields Supported Not supported
Default values Custom defaults Type defaults (0, "", false)
Presence tracking Optional Limited (use optional keyword)
Unknown fields Preserved Preserved
Recommendation Legacy only Modern projects

When to use proto2:

  • Maintaining existing proto2 services
  • Need explicit required fields
  • Custom default values required

When to use proto3:

  • New projects (recommended)
  • Simpler syntax
  • Better forward compatibility

Exporting Protocol Buffer Schemas

Export All Schemas

Export protobuf schemas for all registered models:

manager.dump_proto_schemas(
    "schemas/protos/",
    package="com.mycompany"
)

Creates files like:

schemas/protos/
├── User_v1_0_0.proto
├── User_v2_0_0.proto
├── Order_v1_0_0.proto
└── Product_v1_0_0.proto

Export Options

Customize the export:

manager.dump_proto_schemas(
    "schemas/protos/",
    package="com.mycompany.events",
    include_comments=True,   # Include field descriptions
    use_proto3=True          # Use proto3 syntax
)

Export Without Documentation

For production schemas without documentation overhead:

manager.dump_proto_schemas(
    "schemas/protos/",
    package="com.mycompany",
    include_comments=False  # Omit comments
)

gRPC Integration

Define a Service

from pydantic import BaseModel


@manager.model("GetUserRequest", "1.0.0")
class GetUserRequestV1(BaseModel):
    user_id: str


@manager.model("GetUserResponse", "1.0.0")
class GetUserResponseV1(BaseModel):
    user_id: str
    name: str
    email: str


# Export schemas
proto_schemas = manager.dump_proto_schemas(
    "protos/", package="com.myapp.users"
)

# Or get individual schema as string
request_schema = manager.get_proto_schema(
    "GetUserRequest", "1.0.0", package="com.myapp.users"
)

# Write to file
Path("protos/user_service.proto").write_text(request_schema)

Manually add service definition to the generated proto:

syntax = "proto3";

package com.myapp.users;

message GetUserRequest {
  string user_id = 1;
}

message GetUserResponse {
  string user_id = 1;
  string name = 2;
  string email = 3;
}

// Add this service definition
service UserService {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
}

Compile Protocol Buffers

Use protoc to generate code:

# Generate Go code
protoc --go_out=. --go-grpc_out=. protos/*.proto

# Generate Java code
protoc --java_out=. protos/*.proto

Use Generated Code

package main

import (
    "context"
    "log"

    pb "github.com/myapp/protos"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
)

func main() {
    // Connect to gRPC server
    conn, err := grpc.Dial("localhost:50051", grpc.WithTransportCredentials(insecure.NewCredentials()))
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }
    defer conn.Close()

    // Create client
    client := pb.NewUserServiceClient(conn)

    // Create request
    request := &pb.GetUserRequest{
        UserId: "123",
    }

    // Call service
    response, err := client.GetUser(context.Background(), request)
    if err != nil {
        log.Fatalf("GetUser failed: %v", err)
    }

    log.Printf("User: %s (%s)", response.Name, response.Email)
}

Language Interoperability

Protocol Buffers work across languages:

Python to Go

Python (producer):

@manager.model("Event", "1.0.0")
class EventV1(BaseModel):
    event_id: str
    timestamp: datetime
    data: dict[str, str]

# Get schema as string
proto_schema = manager.get_proto_schema("Event", "1.0.0", package="com.events")

# Write to file for Go to compile
Path("protos/event.proto").write_text(proto_schema)

Go (consumer):

// Generate Go code
// protoc --go_out=. protos/Event_v1_0_0.proto

import pb "github.com/myapp/protos"

event := &pb.Event{
    EventId: "evt_123",
    Timestamp: timestamppb.Now(),
    Data: map[string]string{"key": "value"},
}

Python to Java

Python:

manager.dump_proto_schemas("protos/", package="com.myapp")

Java:

# Generate Java code
protoc --java_out=src/main/java protos/*.proto

import com.myapp.EventProtos.Event;

Event event = Event.newBuilder()
    .setEventId("evt_123")
    .setTimestamp(Timestamp.newBuilder().setSeconds(System.currentTimeMillis() / 1000))
    .putData("key", "value")
    .build();

Schema Evolution Best Practices

Backward Compatible Changes

Add new fields with field numbers that don't conflict:

# Version 1
@manager.model("Product", "1.0.0")
class ProductV1(BaseModel):
    id: str
    name: str
    price: float


# Version 2: Add optional fields
@manager.model("Product", "2.0.0")
class ProductV2(BaseModel):
    id: str
    name: str
    price: float
    description: str | None = None  # Backward compatible
    category: str | None = None     # Backward compatible

Generated proto (v2):

message Product {
  string id = 1;
  string name = 2;
  double price = 3;
  optional string description = 4;  // New field
  optional string category = 5;     // New field
}

Old clients can read new messages (ignore unknown fields).

Reserved Fields

When removing fields, reserve their numbers:

# Don't do this - reuses field number
@manager.model("Product", "2.0.0")
class ProductV2(BaseModel):
    id: str
    name: str
    new_field: str  # Don't reuse the field number!

Instead, document reserved numbers in comments:

message Product {
  // reserved 3;  // Was: price (removed in v2)
  string id = 1;
  string name = 2;
  string new_field = 4;
}

Field Numbering

Best practices:

  • Never change field numbers
  • Don't reuse field numbers
  • Reserve 1-15 for frequently used fields (1-byte encoding)
  • Use 16+ for less frequent fields

Real-World Examples

Microservice API

from datetime import datetime
from enum import StrEnum
from uuid import UUID


class OrderStatus(StrEnum):
    PENDING = "pending"
    CONFIRMED = "confirmed"
    SHIPPED = "shipped"
    DELIVERED = "delivered"


@manager.model("CreateOrderRequest", "1.0.0")
class CreateOrderRequestV1(BaseModel):
    """Request to create a new order."""
    customer_id: str
    items: list[str]
    total_amount: float


@manager.model("CreateOrderResponse", "1.0.0")
class CreateOrderResponseV1(BaseModel):
    """Response with created order details."""
    order_id: UUID
    status: OrderStatus
    created_at: datetime


# Export for gRPC service
manager.dump_proto_schemas("protos/", package="com.shop.orders")

Generated proto for CreateOrderResponse_v1_0_0.proto:

syntax = "proto3";

package com.shop.orders;

import "google/protobuf/timestamp.proto";

// OrderStatus enum is top-level
enum OrderStatus {
  PENDING = 0;
  CONFIRMED = 1;
  SHIPPED = 2;
  DELIVERED = 3;
}

// Response with created order details.
message CreateOrderResponse {
  string order_id = 1;
  OrderStatus status = 2;
  google.protobuf.Timestamp created_at = 3;
}

This self-contained schema can be registered in a schema registry as a single subject, with all dependencies (the enum) included at the top level.

API Gateway

@manager.model("ApiRequest", "1.0.0")
class ApiRequestV1(BaseModel):
    """Generic API request wrapper."""
    request_id: UUID
    timestamp: datetime
    endpoint: str
    method: str
    headers: dict[str, str]
    body: bytes | None = None


@manager.model("ApiResponse", "1.0.0")
class ApiResponseV1(BaseModel):
    """Generic API response wrapper."""
    request_id: UUID
    status_code: int
    headers: dict[str, str]
    body: bytes
    duration_ms: int

Common Patterns

Versioned Messages

# V1: Basic order
@manager.model("Order", "1.0.0")
class OrderV1(BaseModel):
    order_id: str
    total: float


# V2: Add customer info
@manager.model("Order", "2.0.0")
class OrderV2(BaseModel):
    order_id: str
    total: float
    customer_id: str | None = None
    customer_email: str | None = None


# Both versions coexist
# Clients choose which version to use

Polymorphic Messages

Use oneof for polymorphic data:

@manager.model("Notification", "1.0.0")
class NotificationV1(BaseModel):
    notification_id: str
    timestamp: datetime
    # Use union for polymorphic content
    content: str | dict[str, str]  # Text or structured

Pagination

@manager.model("ListUsersRequest", "1.0.0")
class ListUsersRequestV1(BaseModel):
    page_size: int = 50
    page_token: str | None = None


@manager.model("ListUsersResponse", "1.0.0")
class ListUsersResponseV1(BaseModel):
    users: list[dict[str, str]]  # Simplified for example
    next_page_token: str | None = None
    total_count: int

Schema Testing

Validate Generated Schemas

import subprocess


def test_proto_schema_validity() -> None:
    """Test that generated schemas are valid protobuf."""
    manager.dump_proto_schemas("test_protos/", package="com.test")

    # Validate with protoc
    result = subprocess.run(
        ["protoc", "--syntax_only", "test_protos/*.proto"],
        capture_output=True
    )

    assert result.returncode == 0, f"Invalid proto: {result.stderr}"


def test_proto_compilation() -> None:
    """Test that schemas can be compiled."""
    manager.dump_proto_schemas("test_protos/", package="com.test")

    # Compile to Python
    result = subprocess.run(
        ["protoc", "--python_out=.", "test_protos/*.proto"],
        capture_output=True
    )

    assert result.returncode == 0

Test Serialization

def test_protobuf_roundtrip() -> None:
    """Test data can be serialized and deserialized."""
    # Generate schema
    manager.dump_proto_schemas("test_protos/", package="com.test")

    # Compile
    subprocess.run(["protoc", "--python_out=.", "test_protos/User_v1_0_0.proto"])

    # Import generated code
    from test_protos import user_v1_0_0_pb2

    # Create message
    user = user_v1_0_0_pb2.User(
        name="Alice",
        email="alice@example.com",
        age=30
    )

    # Serialize
    serialized = user.SerializeToString()

    # Deserialize
    user2 = user_v1_0_0_pb2.User()
    user2.ParseFromString(serialized)

    assert user2.name == "Alice"

Troubleshooting

Import Errors

If well-known types aren't found:

# Ensure google protobuf is installed
# pip install protobuf

# When compiling, include proto path
# protoc -I=/usr/include --python_out=. protos/*.proto

Field Number Conflicts

Avoid reusing field numbers:

# Bad: Reusing field number 3
@manager.model("Product", "2.0.0")
class ProductV2(BaseModel):
    id: str        # field 1
    name: str      # field 2
    category: str  # field 3 (was price in v1) - BAD!

Compilation Errors

Check syntax version matches usage:

# If using proto3 features in proto2 file
protoc: syntax error

# Solution: Use consistent syntax
manager.get_proto_schema("Model", "1.0.0", use_proto3=True)

Comparison with JSON Schema

Feature JSON Schema Protocol Buffers
Generation get_schema() get_proto_schema()
Export dump_schemas() dump_proto_schemas()
Syntax JSON Protobuf DSL
Code generation No Yes (protoc)
Binary format No Yes
Service definitions No Yes (with manual editing)
Use case REST APIs gRPC, microservices

Best Practices

  1. Use proto3 for new projects - Modern syntax with better compatibility
  2. Keep packages consistent - Don't version package names
  3. Add field descriptions - Enable include_comments=True
  4. Never reuse field numbers - Reserve removed field numbers
  5. Test compilation - Validate schemas with protoc
  6. Version control schemas - Keep .proto files in Git
  7. Document service contracts - Add comments to generated files
  8. Use well-known types - Leverage google.protobuf.Timestamp, etc.

Next Steps

Related topics:

External resources:

API Reference: