I'm currently working on enhancing the robustness and clarity of communication between gRPC services and clients, especially around the concept of operation degradations. In many scenarios, operations can partially succeed or encounter non-critical issues that clients should be aware of, without necessarily treating these as outright failures.
Standard gRPC responses typically include either a success response or an error, but there's a gray area for operations that are successful with caveats (e.g., warnings about nearing quota limits, informational messages about optimization, or minor issues that don't halt the operation).
Is there an industry standard that explains how we can structure our gRPC responses to include these "degradation" details in a way that's standardized across services, and actionable for clients?
My Proposed Solution: I'm considering introducing a DegradationDetail message in our protobuf definitions that can encapsulate the severity and details of any degradations encountered during an operation. Here's a sketch of what that might look like:
message DegradationDetail {
string message = 1; // Human-readable message describing the degradation.
DegradationSeverity severity = 2; // The severity of the degradation.
repeated string affected_features = 3; // Specific features or components affected, if applicable.
}
enum DegradationSeverity {
INFO = 0; // Informational message, operation successful.
WARNING = 1; // Warning, operation successful but might require attention.
ERROR = 2; // Error in part of the operation, action required.
message OperationStatus {
bool success = 1; // Indicates overall success of the operation. False might indicate that you should check the degradation details, true indicates move on.
repeated DegradationDetail degradations = 2; // Details of specific degradations.
}
message MyServiceResponse {
// Other fields...
OperationStatus operation_status_foo = N; // Incorporate OperationStatus with support for multiple degradations for operation "foo".
OperationStatus operation_status_bar = N+1; // Incorporate OperationStatus with support for multiple degradations for operation "bar".
}
This DegradationDetail could then be included in response messages as an optional field. For operations that fully succeed without any issues, this field would be absent. For operations with degradations, this field provides a structured way to communicate what happened, its severity, and what parts of the operation were affected. This would then be communicated in service response for each operation if needed.
Questions:
- Is this approach in line with best practices for gRPC services? I am potentially, sort of, defining a company wide policy, and I want to ensure that this pattern doesn't conflict with gRPC's design principles or error handling mechanisms.
- How should clients best utilize this information? I'm interested in patterns or practices for clients to handle these degradation details effectively?
- Would a centralized DegradationDetail definition be beneficial? This could be defined in a common library to ensure consistency across services. However, I'm curious about the trade-offs in flexibility versus standardization.
I'm eager to hear feedback from the community on this approach, especially from those who have tackled similar challenges in their gRPC services.
Your approach does enhance the communication clarity between gRPC services and clients.
As a pseudo-code example (in Go, since it is the language I use most these days)
Not that I know of.
The main alternative to your solution is including the gRPC rich error model: it allows sending detailed error information using the
Statusmessage (which can include arbitrary metadata).You could consider leveraging this for error conditions, reserving
DegradationDetailfor non-error conditions that still require attention.You also have the concept of Graceful Degradation, when a component (such as a microservice) continues to work with degraded functionality, when it is unable to function fully.
That document points to gRPC status codes, which goes way beyond "success/failure".
I would still consider using gRPC error, leveraging the built-in error handling mechanism to include detailed error information alongside standard error responses.
That would involve the
Statusmessage, which can encapsulate both a numeric code and a human-readable message, as well as custom metadata in the form ofgoogle.rpc.Statuswith additional error details.That approach could be useful for scenarios where you want to communicate more nuanced error states, or additional context about failures or warnings without necessarily modifying the primary response structure of your RPC calls.
You would define custom error detail messages using protobuf. These can be similar to your
DegradationDetailmessage but designed to be included in error metadata.In your gRPC service implementation, you can use the
statuspackage from the Go gRPC library to create and return rich errors that includeDegradationDetailas part of the error's details.Clients receiving the error can extract and handle the
DegradationDetailfrom theStatusdetails.The cons for that approach:
With your solution, clients can check the
operation_statusfield of each response to determine if further attention is needed.For example, in Go:
In theory, yes: the trade-off in flexibility versus standardization usually leans towards benefiting standardization in large-scale systems. You would want consistency in error handling and response formats, that would significantly reduces complexity and the burden on client-side implementations.
However, make sure there is enough flexibility within the
DegradationDetailmessage to cover the various scenarios services might encounter without making the message too generic or cumbersome to use.And make sure the introduction of degradation details is backward compatible with existing clients. That may involve versioning your protobufs or providing default behavior for clients unaware of the
DegradationDetailstructure. You would also need to incorporate logging or monitoring around the use of degradation details to track and analyze their occurrence and impact.