I have come to know about opentracing and is even working on a POC with Jaeger and Spring. We have around 25+ micro services in production. I have read about it but is a bit confused as how it can be really used.
I'm thinking to use it as a troubleshooting tool to identify the root cause of a failure in the application. For this, we can search for httpStatus codes, custom tags, traceIds and application logs in JaegerUI. Also, we can find areas of bottlenecks/slowness by monitoring the traces.
What are the other usages?
Jaeger has a request sampler and I think we should not sample every request in Prod as it may have adverse impact. Is this true?
If yes, then why and what can be the impact on the application? I guess it can't be really used for troubleshooting in this case as we won't have data on every request.
What sampling configuration is recommended for Prod?
Also, how a tool like Jaeger is different from APM tools and where does it fit in? I mean you can do something similar with APM tools as well. For e.g., one can drill through a service's transaction and jump to corresponding request to other service in AppDynamics. Alerts can be put on slow transactions. One can also capture request headers/body so that they can be searched upon, etc.
There's a lot of different questions here, and some of them don't have answers without more information about your specific setup, but I'll try to give you a good overview.
Why Tracing?
You've already intuited that there are a lot of similarities between "APM" and "tracing" - the differences are fairly minimal. Distributed Tracing is a superset of capabilities marketed as APM (application performance monitoring) and RUM (real user monitoring), as it allows you to capture performance information about the work being done in your services to handle a single, logical request both at a per-service level, and at the level of an entire request (or transaction) from client to DB and back.
Trace data, like other forms of telemetry, can be aggregated and analyzed in different ways - for example, unsampled trace data can be used to generate RED (rate, error, duration) metrics for a given API endpoint or function call. Conventionally, trace data is annotated (tagged) with properties about a request or the underlying infrastructure handling a request (things like a customer identifier, or the host name of the server handling a request, or the DB partition being accessed for a given query) that allows for powerful exploratory queries in a tool like Jaeger or a commercial tracing tool.
Sampling
The overall performance impact of generating traces varies. In general, tracing libraries are designed to be fairly lightweight - although there are a lot of factors that influence this overhead, such as the amount of attributes on a span, the log events attached to it, and the request rate of a service. Companies like Google will aggressively sample due to their scale, but to be honest, sampling is more beneficial to consider from a long-term storage perspective rather than an up-front overhead perspective.
While the additional overhead per-request to create a span and transmit it to your tracing backend might be small, the cost to store trace data over time can quickly become prohibitive. In addition, most traces from most systems aren't terribly interesting. This is why dynamic and tail-based sampling approaches have become more popular. These systems move the sampling decision from an individual service layer to some external process, such as the OpenTelemetry Collector, which can analyze an entire trace and determine if it should be sampled in or out based on user-defined criteria. You could, for example, ensure that any trace where an error occurred is sampled in, while 'baseline' traces are sampled at a rate of 1%, in order to preserve important error information while giving you an idea of steady-state performance.
Proprietary APM vs. OSS
One important distinction between something like AppDynamics or New Relic and tools like Jaeger is that Jaeger does not rely on proprietary instrumentation agents in order to generate trace data. Jaeger supports OpenTelemetry, allowing you to use open source tools like the OpenTelemetry Java Automatic Instrumentation libraries, which will automatically generate spans for many popular Java frameworks and libraries, such as Spring. In addition, since OpenTelemetry is available in multiple languages with a shared data format and trace context format, you can guarantee that your traces will work properly in a polyglot environment (so, if you have Node.JS or Golang services in addition to your Java services, you could use OpenTelemetry for each language, and trace context propagation would work seamlessly between all of them).
Even more advantageous, though, is that your instrumentation is decoupled from a specific vendor or tool. You can instrument your service with OpenTelemetry and then send data to one - or more - analysis tools, both commercial and open source. This frees you from vendor lock-in, and allows you to select the best tool for the job.
If you'd like to learn more about OpenTelemetry, observability, and other topics I wrote a longer series that you can find here (look for the other 'OpenTelemetry 101' posts).