Geeking Out with Adriana Villela

The One Where We Geek Out on the OTel Operator with Jacob Aronoff of SNCO

Episode Summary

This week, Adriana geeks out with fellow co-worker, Jacob Aronoff. Jacob highlights his experiences in leading an internal OpenTelemetry migration from OpenTracing. He also digs into the OpenTelemetry Operator's Target Allocator, highlighting how it can be used to supercharge Prometheus. Adriana and Jacob also reflect on the inclusiveness of the OpenTelemetry community, and how it encourages contributions and questions. Finally, Jacob talks about the Open Agent Management Protocol (OpAMP), and his recent KubeCon North America talk on the subject.

Episode Notes

About our guest:

Jacob Aronoff (he/him/his) is a Staff Engineer at ServiceNow Cloud Observability, formerly Lightstep, the tech lead for the Telemetry Pipeline team, and an OpenTelemetry maintainer for the OpenTelemetry Operator project. He's spent his career in a variety of backend roles acting as a distributed systems engineer, an SRE and a DevOps professional. Jacob's focus is enabling customers to reliably send telemetry data with a focus on Kubernetes and OpenTelemetry.

Find our guest on:

Find us on:

Show Links:

Additional Links:

Transcript:

ADRIANA: Hey, y'all. Welcome to Geeking Out, the podcast about all geeky aspects of software delivery, DevOps, Observability, reliability, and everything in between. I'm your host, Adriana Villela. Coming to you from Toronto, Canada. And geeking out. With me today is Jacob Aronoff, who is also one of my coworkers. Welcome, Jacob.

JACOB: Hello. Very happy to be here. I'm so happy that we get to do this. I feel like we talked about this in Amsterdam, and I'm so excited that we get to make it happen.

ADRIANA: I know, right? Yeah. This is awesome. So as we start out, I'm going to do some lightning round questions. They are totally painless. No wrong answers. So are you ready?

JACOB: I'm prepared. Let's do it.

ADRIANA: Okay, cool. All right. Are you a lefty or a righty?

JACOB: I am a righty. So I always thought I was supposed to be a lefty, and my parents forced me to be a righty.

ADRIANA: Interesting. Soul of a lefty. iPhone or Android?

JACOB: iPhone. I just got the new one. USB-C all the way.

ADRIANA: I'm so jealous. I think I'm going to wait one more year because I want the iPhone...I don't like the Pro Max. It's too big. But I want the Pro.

JACOB: It's way too big.

ADRIANA: I want to wait until they upgrade the optical zoom to whatever the Pro Max offers.

JACOB: Yeah, that makes sense.

ADRIANA: Yeah. Anywho, go on. Okay. Mac, Linux, or Windows?

JACOB: Mac for sure. Big Mac boy. Whole life.

ADRIANA: Feel you. I feel you. Okay. Favorite programming language?

JACOB: I feel like Go. I mean, I'm a huge fan of Go. It used to be Swift or Elixir. Those are my two a little bit more funky choices. I used to work in Elixir, and I really loved it. Definitely one of the most fun languages I've had the chance to do. Swift, I haven't done for a few years, but there are a lot of little Easter eggs around my socials that refer to Swift a lot.

ADRIANA: That's why your social handle is get_sw1fty.

JACOB: Exactly. Yeah.

ADRIANA: Okay, I get it.

JACOB: A lot of Easter eggs.

ADRIANA: Nice.

JACOB: Still, I was the first person to ever write a Datadog SDK in Swift, and it's still on their website.

ADRIANA: Wow. That is awesome. Very nice. Very nice. Cool. Okay, next question. Dev or Ops?

JACOB: That's a really hard one. Dev. I'm just going to say dev.

ADRIANA: All right.

JACOB: Ops is fun, but you're still doing Dev if you're doing Ops. You're still Deving. You're still Deving.

ADRIANA: I like it. Especially modern Ops. Right? I mean, maybe not...well, even Bash scripting back in the day, right? Ops was more bashy, less like Terraforming.

JACOB: Yeah. Back when Ops is mostly just like Jenkins scripting with Bash. That's still Dev. There's still a lot of Dev stuff in there, so it's always been like that. It's just new abstractions.

ADRIANA: Yeah, fair enough. That's a really good point. I like it. Okay, next question. JSON or YAML?

JACOB: It's just...I'm a YAML engineer. I can't deny it.

ADRIANA: Yeah, I like YAML better. No disrespect to the JSON people out there, but I don't get it. YAML forces me to do indentations, but that's okay.

JACOB: Yeah, that's all right.

ADRIANA: Yeah, cool. Two more questions. Do you prefer to consume content through video or text?

JACOB: Probably text. I love to read really long form things, especially, I don't know, I save a bunch of articles whenever I see them and they'll be like, ten minute, 20 minutes reads, and whenever I have some real free time, then I'll go through one or two of them and that is like my favorite way to consume. I probably consume more video, realistically.

ADRIANA: Oh, really?

JACOB: Yeah, I watch a lot of YouTube videos, like "How To" type things.

ADRIANA: Yeah.

JACOB: But I love to read more than I love to watch. Watching is too passive.

ADRIANA: I get too yeah, I agree. I think that's what I find annoying about watching videos. Like, someone sends me a video link, I'm like, it better be like some short video. So if it's like an Instagram video or YouTube short, it's fine, but send me a five minute video, I'm like, I'm never going to watch it. Even if you tell me it's like the most wonderful thing in the world, I'm not going to watch it. I'm so sorry.

JACOB: Or it's like, even if you watch it, you get so distracted by another thing. It's just like I don't know.

ADRIANA: Yeah, I think the only way I can consume, quote unquote, a YouTube video is if it's audio only. So I'm like just doing chores around the house and listening to it, then it's okay, right? My brain is like it helps me focus better.

JACOB: I feel that basically you're just podcasting at that point.

ADRIANA: Yeah, exactly. Which I love me a good podcast.

JACOB: Yeah.

ADRIANA: Okay, final question. What is your superpower?

JACOB: Superpower? I have a useless superpower. I can do a noise. I can make a noise that's really I can click with my tongue really loudly.

ADRIANA: Okay, now you have to demonstrate.

JACOB: I will, but it might disturb some people in this office. Okay.

ADRIANA: Damn.

JACOB: I don't know if that came through.

ADRIANA: It came through okay over here.

JACOB: It's really loud.

JACOB: That was like a quieter one.

JACOB: It's useful when it's like, I need to get someone's attention who knows that I can do that. And then I'll do the click, and then they'll be like, oh, there he is.

ADRIANA: Nice. I like, that. Cool. All right, now we shall get to the meaty bits, which is sweet. Let's talk OTel.

JACOB: Let's do it. I'm ready.

ADRIANA: All right. Yeah. So I guess for starters you're involved as part of your so we both work at Lightstep, which I guess is now ServiceNow Cloud Observability. I guess you and I met because we both work in the OTel space, although we work in different areas of the OTel space. Why don't you tell folks what you do specifically around OTel?

JACOB: Yeah, so I sort of got started with OTel two years ago when I joined the company working on the OTel Kubernetes story and what's going on there. Basically I came from a Prometheus shop that really heavily invested in Prometheus and I had sort of seen the great stuff with Prometheus and then some of the struggles with Prometheus and I came in and I was, you know, I now work on top of a metrics backend. What's the best way to get metrics there? OTel has the OTLP format and so I wanted to figure out the best way to get Prometheus metrics into the OTLP format and then into our backend, specifically in Kubernetes and what is the best way to do that. So sort of began this journey on the operator group, which is a SIG within OTel that works on a piece of OTel code that sits within your Kubernetes cluster, within your environment to make it really easy to deploy OTel Collectors and do auto instrumentation and things like that. And then the feature I was working on was to make it so that you could really easily scrape and scale metrics collection. So that was sort of my first foray into it. And then I started contributing a lot. I became a maintainer for the project and now I just sort of work on OTel Kubernetes stuff all the time. So thinking about new features, new ways to help users run their whole environment for telemetry collection in Kubernetes, that's really the focus.

JACOB: How do we make that as easy as possible for people? There's definitely a lot to be done, but it's a really great group of people that I think think pretty deeply about this stuff and are very good at sharing and caring and not very what's the word? Nobody's really holding on to legos. Have you heard that phrase? Is that like a known phrase? Yeah.

ADRIANA: I haven't heard that expression before, but I like it.

JACOB: Everybody's happy to share. There's not really someone who's particularly unwilling to accept something. Yeah, nothing like that. It's really based on the merit of the feature, not the fact that you don't get to do it nice. It's a good group as a result.

ADRIANA: I really like that and I can vouch for that too because I've bugged you with a bunch of questions around the operator when I was trying to understand it better. And I've also posed questions to the operator Slack Channel and people have just generally been really nice about answering my questions, which is awesome because I think definitely tech has, I would say. I'm sure it still exists. But you see stack overflows where people ask questions and then you get some asshole who's putting you down because you're a novice to the subject and you're just trying to understand it. I get none of that from the Otel community, which I love because then it makes me unafraid to ask questions and so it makes it easier to learn.

JACOB: Yeah, and a thing that I try to make sure of, at least with our group, is for anybody who's like a new contributor. I try to go really out of my way to thank them for their contribution and make sure that they're sort of set up for success with what they're doing. Like, even today, someone was asking some questions on our GitHub about some operator features. I gave them their answers and they said, if you have more questions, reach out in our slack. Happy to follow up there. And so they followed up, asked some more questions. They asked for a feature that we didn't have. I was like, oh, if you make an issue for that, we can get that on the books.

JACOB: It's not that hard. And then I was like, hey, this is actually really easy feature. If you wanted to contribute it, I can walk you through that process. I can show you an example of, like, here's an example that you can look at for someone who did something similar in the past and let me know if you have any questions. And that's what they're going to go do now. They're going to make their first contribution. So it's something that I'm really happy to see as not just with my group, but like, all the groups, people are really happy to walk you through contributions and make sure that you're supported. And if there's a feature that you want, people will actually take you seriously.

JACOB: They respond to you with sincerity, not what's the other word? They respond to you with sincerity, not hostility. And so there are no questions that you could ask that I've seen where someone's going to really get angry at you for asking that question. And I think that that's, like, a really nice thing. It's good to see a humble bunch and not like, a really egotistical bunch.

ADRIANA: Yeah, I completely agree. And I think that's why people keep contributing to OpenTelemetry, which is great. Now, as a follow up question related to OpenTelemetry, we had you on for the OTel End User Working Group for, well, two sessions. So first for our Q&A session and our OTel in Practice, which we host those two sessions on a monthly basis. And you had a really cool story, actually, about migrating to OTel within the context of an observability company migrating itself to OTel. And why don't you talk a little bit about that? I think it's so cool.

JACOB: Yeah. So previously our company was on...before we had a metrics platform...we were on stated. Like, all of our metrics were recorded via statsd. Sometimes we would rewrite them in traces, which was pretty weird, or we would have them go through a proxy so that we could aggregate them in some way and get some information out of them. So we were previously on the statsd, and then we were also on a really old version of OpenTracing. This was before the OpenTracing and OpenCensus projects merged into OpenTelemetry. And so we were on that old OpenTracing version.

JACOB: And so I took on this work to migrate us to OpenTelemetry for everything. Well, metrics and traces. Logs support is still in the works, but that's the next migration. But so I started this project for migrating our metrics to OpenTelemetry, at which point the metrics SDK was still in beta, or the metrics API was still in beta, the SDK was in alpha. And so the goal was to really help the people on the, you know, iterate on their designs, work on performance and really tighten up that spec. So I did that, and then I actually found a bug in our maybe not a bug, a performance issue in the metrics code, which was a result of us having to convert from the new OTel format for attributes into the old OpenTracing sorry, other way around to convert from the OpenTracing attributes format to the OpenTelemetry attributes format. The reason this was a problem was because we shared this implementation between our tracing and metrics, and it meant that every time we recorded a metric, we had to do this conversion on the fly. And it doesn't sound that bad on an individual basis, but when you're recording hundreds of thousands, millions of metric points, that's a lot of conversions and that type of thing can really add up totally. And after I gave some of this performance feedback to the team, I actually realized that we could do this OpenTelemetry migration for tracing as well, which would then get rid of this performance concern.

JACOB: And so in the midst of the metrics migration, I took a pause and then we began the tracing migration. The tracing migration was much easier because it was a more mature format at the time. So that process was a bit smoother. There were a few weird things here and there. You can read about that, I think online somewhere that we have documented, maybe, I think there's some blog posts.

ADRIANA: We have the recording from your OTel in Practice, OTel Q&A discussion as well.

JACOB: Yeah, cool, thanks. But so we finished that migration, we went back to the metrics migration. We got to use that performance benefit. And the OTel people actually worked on a lot of the performance recommendations that we made. So we were able to finish the metrics migration as well. And so it was really neat because I love these types of migrations, because you're really just like, you'll see the phrase a lot, replacing the engine of a flying plane. It's like doing that in place. And that's really what it feels like sometimes when you're dealing with hundreds of thousands of data points per second, how do you replace your telemetry collection about that? That's a pretty challenging thing for any company, not just us.

JACOB: But then when you're the vendor serving the metrics. It's like, who's watching the watcher? That type of thing. Really the most difficult part is just reorienting your brain to think about the environments correctly to be sure that when you're talking about environment A, you are sure that that's where the data should be and not somewhere else, right?

ADRIANA: Yeah.

JACOB: Because for most of these telemetry vendors, whether it's us or Datadog or New Relic, it doesn't really matter. All of them have a meta telemetry environment that's sort of the secondary place that they send the telemetry of their main environment to. So that's the thing that you're monitoring. That's what lets you do these migrations effectively as well.

ADRIANA: Yeah. So here's a question because this is actually like a really cool use case, because when we talk about bringing in OpenTelemetry to an organization, if you're lucky and you're starting out your application from scratch, you have the luxury of factoring observability into your architecture, right? And so you can start instrumenting in OpenTelemetry right off the bat, hopefully, right? One can dream. But then you also have the so called brown field scenarios, right, where it's brownfield. I have zero instrumentation and then there's the brownfield of like, I have instrumentation, but it's out of date. And I think that's something or not out of date, but it's not up to date with a standard, which now like the standard being OpenTelemetry. And so those are two really interesting conversations to have because I think a lot of the organizations that are adopting OpenTelemetry probably fall into one of those two categories. And from talking to a lot of folks, it's interesting too, because you have this conversation of like, you start telling them, oh yeah, I work in OpenTelemetry. Oh yeah, OpenTracing, we use that.

ADRIANA: And I'm like, no, not the same, not really. You're having to educate them on that. But folks are also like, even if you get them sold on, like, okay, OpenTelemetry is the thing you got to now talk about a strategy for bringing that into the organization. And that can be very tricky. I mean, where we're at, it was an easy sell because it's like, well.

JACOB: Yeah, this is what we do, this is what we work on. We should be doing it ourselves.

ADRIANA: Yeah, exactly. So that's not even the problem. But even with that easy...I'll say easy, right? Because you're not having to deal with that hurdle. You have the hurdle of like, well, I've got some existing stuff now that I have to migrate. So one thing I'm wondering is, as you mentioned, there was some old OpenTracing stuff in place. And one of the things about OpenTelemetry is that they say they're backwards compatible with OpenTracing, OpenCensus. Now, which from my understanding means that if you have that stuff in place, you don't have to gut it right away.

ADRIANA: However, you probably don't want it to stay that way forever. So what do you say to folks who are in that position?

JACOB: A real it's a benefit that OTel provides these bridges to these legacy formats so that you can start using OTel and then get all of that in place. The thing that I always think about whenever doing these migrations, whether it's like a service, your telemetry, it doesn't really matter. The question is, how long do you want to be in a dual state? How long do you want to be in a state where you're potentially confusing someone on call? It's like the real crux of the issue is it's like always imagine yourself on call for whatever service you're changing, and someone gets paged at, like, 3:00 A.m.. Do you really want someone to have to reason about where your telemetry is coming from or how it's getting generated? You don't you really want that to be consistent. You don't want to have to ask the question, oh, is this like an OpenTracing thing? Is this an OTel thing? In the same way that if you're migrating a service and you have legacy service and new service, if you're in the dual state for a long time and you get a page for an upstream thing that's related to both of these downstream services, it's really frustrating to have to ask the question, which of these downstream things is affecting me? Right? Yeah, it'd be much easier if it was just I look at the single downstream, and I know that's the problem. Basically, it's shaving the decision tree for.

ADRIANA: This that you're doing.

JACOB: And so anything that you can do to remove the amount of time that you're in that dual state, removing those branches is going to do you better in the long run. The migration path is good that you can do this. There's another path, which I also think is a great option, where the OTel Collector probably supports whatever format you have right now. I'd be surprised if it doesn't. What you could do is just send rather than installing a bridge into your code, you could just send your legacy format to the Collector and have the Collector output, and then you can change your application to use OTel in whatever time frame you want, and then just have that sent to the collector, which already accepts OTLP. Yeah, right. And so that'll help you actually verify that the migration worked. You're already getting OTLP.

JACOB: You don't have to do anything with that. And then once you start sending OTLP from your application, you should see no difference in what's yeah, and that's a pretty verifiable thing. You could actually even use the file exporter on the OTel Collector to actually dump the data that you get. And then for Service A, run it with Jaeger for ten minutes, dump that data with the OTLP out, and then do Service A again, but with OTLP, dump that data for ten minutes, and then just see what it looks like, understand that you should see, like, a pretty minimal difference between those.

ADRIANA: Right.

JACOB: And that type of thing can give you so much confidence. And you can do that probably from your local environment without even needing to push it up. And so that's something that we didn't really consider as an option at the time. But had we thought of that, I definitely would have done it that way. It would have been a great option.

ADRIANA: Yeah.

JACOB: Where we could have just moved to OTel instantly and then backfill. Right. That's like a much easier path.

ADRIANA: Yeah, I agree. I mean, it's a very low friction approach, especially at my old company. They were using OpenTracing in a few spots, and so the mention of moving to OTel kind of sent people in a panic. Like, we have to re-instrument. Yes, we do. But hopefully never again after. But that idea sent people in a panic, and I had the same thought as you, which was like, yeah, just pump it through the Collector. Like, you don't have to change your code right away, but with the intention of eventually changing your code.

ADRIANA: Because now, correct me if I'm wrong, but if you continue on OpenTracing, you don't get to reap the benefits that you get with the whole OTel ecosystem, right? I mean, you don't end up with the traces and metrics correlation and the traces and logs correlation or any new updates to the API or SDK, right? You're kind of stuck with whatever OpenTracing was when it froze, when it was retired, basically.

JACOB: Yeah. Which means if there are any CVEs, you're kind of like, out of luck. Which is a bad state to be in.

ADRIANA: Totally.

JACOB: It's a really bad state to be in.

ADRIANA: Yeah. Awesome. Yeah, I definitely like that. Now, going back to the OTel Operator. So you said that you're doing mostly work around the metrics portion. It's the Target Allocator specifically, right?

JACOB: That's exactly. Right.

ADRIANA: Yeah.

JACOB: Now it's a bit more than that.

ADRIANA: Okay.

JACOB: But back then, like, last year was basically all target allocator stuff.

ADRIANA: Okay, cool.

JACOB: I can explain it. So basically when we started this process, someone from AWS had designed this thing called the Target Allocator. The goal of it was that you could distribute Prometheus works in targets. Targets are things that are like IP addresses, like a pod, a node, your old EC2 instance, whatever it is. You then go and scrape that instance to generate metrics. Prometheus works where it's a single monolith and you have a list of targets and it scrapes those and stores that data. You have to do this because if you have more than one instance of Prometheus, there's no way to tell which instance should scrape which thing. And so you're just going to be duplicating those scrapes. With OTel, we have the benefit of we don't need to store those metrics because we're just handing them off to the next thing with OTLP.

JACOB: So the Target Allocator's goal is to allow you to distribute those targets amongst a pool of collectors. So if you have 300 targets and you have three Collectors, the Target Allocator could say, I'm going to give each collector 100 targets evenly. Right, but you need to have 100.

ADRIANA: Collectors then to send it to...is that what that means?

JACOB: No, you would just have to have...sorry...if you have 300 targets and you have three Collectors, then it's 100 targets per collector and then you would just forward that to your destination. So it'd be like if your destination is Prometheus actually, which now accepts OTLP, you could have OTel do all of your scraping and then just send the data to Prometheus as your backend store, right? And that would be like a totally viable option.

ADRIANA: Gotcha.

JACOB: If you really wanted the ability to shard your scraping and scale how you scrape targets, that would be a pretty viable approach.

ADRIANA: Right, which Prometheus doesn't support the sharding right now, right?

JACOB: So Prometheus has experimental sharding support but it doesn't have the ability. So it can shard your scraping, but it can't figure out your querying effectively. So because Prometheus is also a database. If you have three instances of Prometheus that are scraping each different targets, you'll only be able to query...you'll have to query the right instance each time because it doesn't know how to do that communication...to ask for, "Who has this metric?" At least that's my understanding of it. Maybe they've changed that, but I don't think they have.

ADRIANA: Cool, okay. Yeah, that's super interesting. And so this allows you to scrape the Prometheus metrics which are not I mean, basically you're scraping it from wherever your source of Prometheus metrics is, right? It can be whatever, it can be coming from your infrastructure or whatever. And then this thing basically does the sharding for you and then it'll send your metrics to a destination. The destination could be Prometheus itself or it could be any observability backend that supports metrics essentially.

JACOB: Yeah, yeah, exactly. Cool. And that's the real benefit. I mean, we also open up by using the Target Allocator, we can be a little bit smarter as well. So the thing that Prometheus does, because it's all in one, is most of the targets that you get, you're just going to drop. The way that the scrape configs work is you get a target which has a bunch of metadata and then your scrape config determines whether or not you should actually get the data from that target.

ADRIANA: Got it.

JACOB: Even prior to making the request. And so usually you have to keep all of those in memory because you're constantly scraping them and you're constantly asking this question does the metadata match my scrape config? Does the metadata match my scrape config? And so forth. Whereas because we have the Target Allocator, we can actually just drop any targets that we know the Collector won't scrape okay in advance. So we only tell the Collector to process targets that it will end up scraping.

ADRIANA: Okay, so it's like a filter.

JACOB: Exactly. That's what we call it. We call it a relabel filter.

ADRIANA: Okay.

JACOB: So the real reason that this is really cool and why we added this in is because then we can also really evenly distribute targets to Collectors because we can say only. So if you have 300 targets, we use this strategy called consistent hashing, where you just hash each target and their metadata to assign that to a Collector ID. And so if you have, like, let's say, 500 targets, but you really are only going to end up scraping 100 of them after this filter, it would be better if you only tell the Collectors...if you only distribute the targets that you're going to end up scraping, because it's going to be more even rather than trying to fit in. It's the pigeonhole principle, right?

ADRIANA: Yeah.

JACOB: If you have three boxes and you have 500 targets, you might evenly distribute it at first, but eventually, when you go to scrape them, it might be uneven once you figure out what you're actually going to scrape.

ADRIANA: Right. By the time the Collector is receiving them, you've already just gotten the ones that you want, and so it can give you an even distribution of those. So then there isn't an imbalance, basically.

JACOB: Yeah, exactly.

ADRIANA: Nice. That is super cool.

JACOB: It's very clever.

ADRIANA: Every day. Yeah, that's very awesome. So is the Target Allocator only part of the OTel Operator? Is that something that's available as part of the standalone collector?

JACOB: So the Target Allocator is its own image. Like, it runs separate from the Collector binary. You could theoretically run it without the Operator. There are definitely some people that do that, but we don't support that as like, first class support. Reason why is that we do a lot of logic to rewrite. In order to make this work, you have to rewrite the Collector's configuration, and you also have to rewrite the Target Allocators configuration. It's just a bit of, like, data munging that we don't want users to have to do just because it's a little bit complicated. So we do it in the Operator for you.

ADRIANA: Yeah.

JACOB: There are people who will take what the Operator gives you, remove the Operator, and then just run it themselves.

ADRIANA: Right.

JACOB: And that's kind of a viable option. Yeah, but that's bespoke you'd have to do that yourself. And if you ask me a bunch of questions, I'll try to help you, but there's a certain point at which I can't help you. I don't know what you're doing.

ADRIANA: That sounds like someone's idea of, like, a fun weekend project.

JACOB: So we have a bunch of requests from people to enable the Target Allocator as part of the Helm chart, the raw Collector Helm chart. And I tried to do it, and it was so hard. It just proved so difficult to do. The config rewriting was so challenging because Helm isn't really a language. It gives you some go templating stuff, but at a certain point, it doesn't get you all the way there.

ADRIANA: Right.

JACOB: And so I wasn't able to make it work, and I eventually decided to give up because it was too much of a time.

ADRIANA: Yeah, that makes sense.

JACOB: Which is unfortunate because people ask for it a lot.

ADRIANA: Yeah, that's interesting.

JACOB: Yeah.

ADRIANA: Now, obviously there's an OTel Operator because obviously a lot of people run the Collector in Kubernetes. Do you know, is it common for people to run collectors outside of Kubernetes? I mean, obviously, if you're not a Kubernetes shop, I would imagine that would be the use case. But how common is it? Do you know?

JACOB: I don't know. I mean, I'm sure there are a bunch of people that do it, because I'm in my little Kubernetes world, I don't hear about it that often.

ADRIANA: Yeah, fair enough. Fair enough.

JACOB: I'm pretty isolated, but there are definitely people who just run Collectors as binaries on raw EC2 instances.

ADRIANA: Yeah.

JACOB: GCS instances. People are doing it, for sure.

ADRIANA: Yeah.

JACOB: I don't know. They probably have a whole different class of problems than the one.

ADRIANA: I know we're coming up on time, but I wanted to ask you quickly. Well, by the time this episode comes out, I don't know if KubeCon will have passed, but all the same, but do you have anything coming up at KubeCon that you want to talk about?

JACOB: I do indeed. So one of the main projects I'm doing for the Operator right now is adding support for the OpAMP protocol, which is a new part of OpenTelemetry that gives users the ability to do remote configuration management and agent configuration and Observability, sort of, with superpowers. And I'll be giving a talk with Andy Keller from ObserveIQ on OpAMP and how it's going to make your life a lot easier to manage these pools of Collectors that you have. So I am working on this project in the Operator group that will allow you to basically understand the topology of your Collectors in your Kubernetes cluster and also remotely configure them. Add in new features, push out updates, everything that basically allow your cluster's observability to be on autopilot for you.

ADRIANA: Nice. Who doesn't love that? Very cool.

JACOB: Stop thinking about it.

ADRIANA: Is that part of Observability Day, or is that part of the KubeCon, like the main conference?

JACOB: Main conference.

ADRIANA: Nice. Very nice. Yeah, very cool.

JACOB: I don't know how many people can fit in the room that I'm in, though. I thought they'd tell you that, but I guess they don't.

ADRIANA: It'll be a surprise the day of.

JACOB: It will. It'll be anywhere from five people to 500 people.

ADRIANA: I'm always nervous for these types of things. I think on the KubeCon schedule, you can see people already will sign up for your talk and you start seeing people signing up to attend your talk. And if it's like a small number, you're like, oh my God. And if it's a large number, you're also like, oh my God.

JACOB: Yeah, I'm very nervous. Yeah.

ADRIANA: Is like a very big deal. But yeah, this is awesome. Very excited for your talk. Oh, the other thing that I wanted to mention also, I don't know if it's going to come out by the time this comes out, but I do want to promote it because you were on the Maintainable podcast, you recorded an episode recently.

JACOB: I did indeed. I don't think that's out yet, but definitely something to look out for, though I have no idea when that'll be out.

ADRIANA: We will find out. Yeah, I think when I recorded an episode, I want to say like, in the spring and it came out a couple of months later.

JACOB: So probably there's a backlog of editing.

ADRIANA: Yeah, exactly.

JACOB: It's a whole process.

ADRIANA: I feel you. I have a backlog of editing for this too.

JACOB: Yeah, that's just how it happens.

ADRIANA: Yeah, totally. But anyway, something to look forward to as well, so you all keep an eye out for that. Now, before we part ways, do you have any interesting pieces of advice, be it like in tech or OTel or whatever, or any hot takes that you wanted to share with folks?

JACOB: I think the thing that I always say is just do something that you enjoy. If you're looking for a job, just like find something that work on a project that you enjoy. Find something that's weird and fun and doesn't really matter and just brings you some joy. I think that we all sort of forget that coding can be really fun and enjoyable and there's so many things out there that are so cool right now, especially. And there's so many things that I think have been forgotten just out of the consciousness. I used to do a lot of coding with SignalFX and Java to do UI building and games and stuff, and I haven't done that in so long, but I had so much fun doing that. So if you're looking for a job and you don't know how to do it, my best advice is to do a project that you find very fun and interesting and not just one that you think will play well on a résumé. Because if I'm interviewing you and you tell me about a project that you were so happy to do and really excited about, that's going to be ten times better than a project that you didn't really care about.

JACOB: Yeah, just have fun is my advice.

ADRIANA: Yeah, that is really great advice and I couldn't agree more. Yeah, and coding should be fun. It definitely puts me in a happy place when I'm working on an exciting project that I dream up some weird thing that I want to explore and then you learn so much and I don't know, you get a high. The programmer's high.

JACOB: Exactly.

ADRIANA: Totally down for that. Awesome. Cool. Well, thanks so much, Jacob, for joining today. So y'all, don't forget subscribe. Be sure to check the show notes for additional resources and to connect with us and with our guests on social media. Until next time...

JACOB: Peace out and Geek out.

ADRIANA: Geeking out is hosted and produced by me, Adriana Vileela. I also compose and perform the theme music on my trusty clarinet. Geeking out is also by my daughter, Hannah Maxwell, who, incidentally, designed all of the cool graphics. Be sure to follow us on all the socials by going to Bento Me slash Geeking Out.

Episode Transcription

ADRIANA: Hey, y'all. Welcome to Geeking Out, the podcast about all geeky aspects of software delivery, DevOps, Observability, reliability, and everything in between. I'm your host, Adriana Villela. Coming to you from Toronto, Canada. And geeking out. With me today is Jacob Aronoff, who is also one of my coworkers. Welcome, Jacob.

JACOB: Hello. Very happy to be here. I'm so happy that we get to do this. I feel like we talked about this in Amsterdam, and I'm so excited that we get to make it happen.

ADRIANA: I know, right? Yeah. This is awesome. So as we start out, I'm going to do some lightning round questions. They are totally painless. No wrong answers. So are you ready?

JACOB: I'm prepared. Let's do it.

ADRIANA: Okay, cool. All right. Are you a lefty or a righty?

JACOB: I am a righty. So I always thought I was supposed to be a lefty, and my parents forced me to be a righty.

ADRIANA: Interesting. Soul of a lefty. iPhone or Android?

JACOB: iPhone. I just got the new one. USB-C all the way.

ADRIANA: I'm so jealous. I think I'm going to wait one more year because I want the iPhone...I don't like the Pro Max. It's too big. But I want the Pro.

JACOB: It's way too big.

ADRIANA: I want to wait until they upgrade the optical zoom to whatever the Pro Max offers.

JACOB: Yeah, that makes sense.

ADRIANA: Yeah. Anywho, go on. Okay. Mac, Linux, or Windows?

JACOB: Mac for sure. Big Mac boy. Whole life.

ADRIANA: Feel you. I feel you. Okay. Favorite programming language?

JACOB: I feel like Go. I mean, I'm a huge fan of Go. It used to be Swift or Elixir. Those are my two a little bit more funky choices. I used to work in Elixir, and I really loved it. Definitely one of the most fun languages I've had the chance to do. Swift, I haven't done for a few years, but there are a lot of little Easter eggs around my socials that refer to Swift a lot.

ADRIANA: That's why your social handle is get_sw1fty.

JACOB: Exactly. Yeah.

ADRIANA: Okay, I get it.

JACOB: A lot of Easter eggs.

ADRIANA: Nice.

JACOB: Still, I was the first person to ever write a Datadog SDK in Swift, and it's still on their website.

ADRIANA: Wow. That is awesome. Very nice. Very nice. Cool. Okay, next question. Dev or Ops?

JACOB: That's a really hard one. Dev. I'm just going to say dev.

ADRIANA: All right.

JACOB: Ops is fun, but you're still doing Dev if you're doing Ops. You're still Deving. You're still Deving.

ADRIANA: I like it. Especially modern Ops. Right? I mean, maybe not...well, even Bash scripting back in the day, right? Ops was more bashy, less like Terraforming.

JACOB: Yeah. Back when Ops is mostly just like Jenkins scripting with Bash. That's still Dev. There's still a lot of Dev stuff in there, so it's always been like that. It's just new abstractions.

ADRIANA: Yeah, fair enough. That's a really good point. I like it. Okay, next question. JSON or YAML?

JACOB: It's just...I'm a YAML engineer. I can't deny it.

ADRIANA: Yeah, I like YAML better. No disrespect to the JSON people out there, but I don't get it. YAML forces me to do indentations, but that's okay.

JACOB: Yeah, that's all right.

ADRIANA: Yeah, cool. Two more questions. Do you prefer to consume content through video or text?

JACOB: Probably text. I love to read really long form things, especially, I don't know, I save a bunch of articles whenever I see them and they'll be like, ten minute, 20 minutes reads, and whenever I have some real free time, then I'll go through one or two of them and that is like my favorite way to consume. I probably consume more video, realistically.

ADRIANA: Oh, really?

JACOB: Yeah, I watch a lot of YouTube videos, like "How To" type things.

ADRIANA: Yeah.

JACOB: But I love to read more than I love to watch. Watching is too passive.

ADRIANA: I get too yeah, I agree. I think that's what I find annoying about watching videos. Like, someone sends me a video link, I'm like, it better be like some short video. So if it's like an Instagram video or YouTube short, it's fine, but send me a five minute video, I'm like, I'm never going to watch it. Even if you tell me it's like the most wonderful thing in the world, I'm not going to watch it. I'm so sorry.

JACOB: Or it's like, even if you watch it, you get so distracted by another thing. It's just like I don't know.

ADRIANA: Yeah, I think the only way I can consume, quote unquote, a YouTube video is if it's audio only. So I'm like just doing chores around the house and listening to it, then it's okay, right? My brain is like it helps me focus better.

JACOB: I feel that basically you're just podcasting at that point.

ADRIANA: Yeah, exactly. Which I love me a good podcast.

JACOB: Yeah.

ADRIANA: Okay, final question. What is your superpower?

JACOB: Superpower? I have a useless superpower. I can do a noise. I can make a noise that's really I can click with my tongue really loudly.

ADRIANA: Okay, now you have to demonstrate.

JACOB: I will, but it might disturb some people in this office. Okay.

ADRIANA: Damn.

JACOB: I don't know if that came through.

ADRIANA: It came through okay over here.

JACOB: It's really loud.

JACOB: That was like a quieter one.

JACOB: It's useful when it's like, I need to get someone's attention who knows that I can do that. And then I'll do the click, and then they'll be like, oh, there he is.

ADRIANA: Nice. I like, that. Cool. All right, now we shall get to the meaty bits, which is sweet. Let's talk OTel.

JACOB: Let's do it. I'm ready.

ADRIANA: All right. Yeah. So I guess for starters you're involved as part of your so we both work at Lightstep, which I guess is now ServiceNow Cloud Observability. I guess you and I met because we both work in the OTel space, although we work in different areas of the OTel space. Why don't you tell folks what you do specifically around OTel?

JACOB: Yeah, so I sort of got started with OTel two years ago when I joined the company working on the OTel Kubernetes story and what's going on there. Basically I came from a Prometheus shop that really heavily invested in Prometheus and I had sort of seen the great stuff with Prometheus and then some of the struggles with Prometheus and I came in and I was, you know, I now work on top of a metrics backend. What's the best way to get metrics there? OTel has the OTLP format and so I wanted to figure out the best way to get Prometheus metrics into the OTLP format and then into our backend, specifically in Kubernetes and what is the best way to do that. So sort of began this journey on the operator group, which is a SIG within OTel that works on a piece of OTel code that sits within your Kubernetes cluster, within your environment to make it really easy to deploy OTel Collectors and do auto instrumentation and things like that. And then the feature I was working on was to make it so that you could really easily scrape and scale metrics collection. So that was sort of my first foray into it. And then I started contributing a lot. I became a maintainer for the project and now I just sort of work on OTel Kubernetes stuff all the time. So thinking about new features, new ways to help users run their whole environment for telemetry collection in Kubernetes, that's really the focus.

JACOB: How do we make that as easy as possible for people? There's definitely a lot to be done, but it's a really great group of people that I think think pretty deeply about this stuff and are very good at sharing and caring and not very what's the word? Nobody's really holding on to legos. Have you heard that phrase? Is that like a known phrase? Yeah.

ADRIANA: I haven't heard that expression before, but I like it.

JACOB: Everybody's happy to share. There's not really someone who's particularly unwilling to accept something. Yeah, nothing like that. It's really based on the merit of the feature, not the fact that you don't get to do it nice. It's a good group as a result.

ADRIANA: I really like that and I can vouch for that too because I've bugged you with a bunch of questions around the operator when I was trying to understand it better. And I've also posed questions to the operator Slack Channel and people have just generally been really nice about answering my questions, which is awesome because I think definitely tech has, I would say. I'm sure it still exists. But you see stack overflows where people ask questions and then you get some asshole who's putting you down because you're a novice to the subject and you're just trying to understand it. I get none of that from the Otel community, which I love because then it makes me unafraid to ask questions and so it makes it easier to learn.

JACOB: Yeah, and a thing that I try to make sure of, at least with our group, is for anybody who's like a new contributor. I try to go really out of my way to thank them for their contribution and make sure that they're sort of set up for success with what they're doing. Like, even today, someone was asking some questions on our GitHub about some operator features. I gave them their answers and they said, if you have more questions, reach out in our slack. Happy to follow up there. And so they followed up, asked some more questions. They asked for a feature that we didn't have. I was like, oh, if you make an issue for that, we can get that on the books.

JACOB: It's not that hard. And then I was like, hey, this is actually really easy feature. If you wanted to contribute it, I can walk you through that process. I can show you an example of, like, here's an example that you can look at for someone who did something similar in the past and let me know if you have any questions. And that's what they're going to go do now. They're going to make their first contribution. So it's something that I'm really happy to see as not just with my group, but like, all the groups, people are really happy to walk you through contributions and make sure that you're supported. And if there's a feature that you want, people will actually take you seriously.

JACOB: They respond to you with sincerity, not what's the other word? They respond to you with sincerity, not hostility. And so there are no questions that you could ask that I've seen where someone's going to really get angry at you for asking that question. And I think that that's, like, a really nice thing. It's good to see a humble bunch and not like, a really egotistical bunch.

ADRIANA: Yeah, I completely agree. And I think that's why people keep contributing to OpenTelemetry, which is great. Now, as a follow up question related to OpenTelemetry, we had you on for the OTel End User Working Group for, well, two sessions. So first for our Q&A session and our OTel in Practice, which we host those two sessions on a monthly basis. And you had a really cool story, actually, about migrating to OTel within the context of an observability company migrating itself to OTel. And why don't you talk a little bit about that? I think it's so cool.

JACOB: Yeah. So previously our company was on...before we had a metrics platform...we were on stated. Like, all of our metrics were recorded via statsd. Sometimes we would rewrite them in traces, which was pretty weird, or we would have them go through a proxy so that we could aggregate them in some way and get some information out of them. So we were previously on the statsd, and then we were also on a really old version of OpenTracing. This was before the OpenTracing and OpenCensus projects merged into OpenTelemetry. And so we were on that old OpenTracing version.

JACOB: And so I took on this work to migrate us to OpenTelemetry for everything. Well, metrics and traces. Logs support is still in the works, but that's the next migration. But so I started this project for migrating our metrics to OpenTelemetry, at which point the metrics SDK was still in beta, or the metrics API was still in beta, the SDK was in alpha. And so the goal was to really help the people on the, you know, iterate on their designs, work on performance and really tighten up that spec. So I did that, and then I actually found a bug in our maybe not a bug, a performance issue in the metrics code, which was a result of us having to convert from the new OTel format for attributes into the old OpenTracing sorry, other way around to convert from the OpenTracing attributes format to the OpenTelemetry attributes format. The reason this was a problem was because we shared this implementation between our tracing and metrics, and it meant that every time we recorded a metric, we had to do this conversion on the fly. And it doesn't sound that bad on an individual basis, but when you're recording hundreds of thousands, millions of metric points, that's a lot of conversions and that type of thing can really add up totally. And after I gave some of this performance feedback to the team, I actually realized that we could do this OpenTelemetry migration for tracing as well, which would then get rid of this performance concern.

JACOB: And so in the midst of the metrics migration, I took a pause and then we began the tracing migration. The tracing migration was much easier because it was a more mature format at the time. So that process was a bit smoother. There were a few weird things here and there. You can read about that, I think online somewhere that we have documented, maybe, I think there's some blog posts.

ADRIANA: We have the recording from your OTel in Practice, OTel Q&A discussion as well.

JACOB: Yeah, cool, thanks. But so we finished that migration, we went back to the metrics migration. We got to use that performance benefit. And the OTel people actually worked on a lot of the performance recommendations that we made. So we were able to finish the metrics migration as well. And so it was really neat because I love these types of migrations, because you're really just like, you'll see the phrase a lot, replacing the engine of a flying plane. It's like doing that in place. And that's really what it feels like sometimes when you're dealing with hundreds of thousands of data points per second, how do you replace your telemetry collection about that? That's a pretty challenging thing for any company, not just us.

JACOB: But then when you're the vendor serving the metrics. It's like, who's watching the watcher? That type of thing. Really the most difficult part is just reorienting your brain to think about the environments correctly to be sure that when you're talking about environment A, you are sure that that's where the data should be and not somewhere else, right?

ADRIANA: Yeah.

JACOB: Because for most of these telemetry vendors, whether it's us or Datadog or New Relic, it doesn't really matter. All of them have a meta telemetry environment that's sort of the secondary place that they send the telemetry of their main environment to. So that's the thing that you're monitoring. That's what lets you do these migrations effectively as well.

ADRIANA: Yeah. So here's a question because this is actually like a really cool use case, because when we talk about bringing in OpenTelemetry to an organization, if you're lucky and you're starting out your application from scratch, you have the luxury of factoring observability into your architecture, right? And so you can start instrumenting in OpenTelemetry right off the bat, hopefully, right? One can dream. But then you also have the so called brown field scenarios, right, where it's brownfield. I have zero instrumentation and then there's the brownfield of like, I have instrumentation, but it's out of date. And I think that's something or not out of date, but it's not up to date with a standard, which now like the standard being OpenTelemetry. And so those are two really interesting conversations to have because I think a lot of the organizations that are adopting OpenTelemetry probably fall into one of those two categories. And from talking to a lot of folks, it's interesting too, because you have this conversation of like, you start telling them, oh yeah, I work in OpenTelemetry. Oh yeah, OpenTracing, we use that.

ADRIANA: And I'm like, no, not the same, not really. You're having to educate them on that. But folks are also like, even if you get them sold on, like, okay, OpenTelemetry is the thing you got to now talk about a strategy for bringing that into the organization. And that can be very tricky. I mean, where we're at, it was an easy sell because it's like, well.

JACOB: Yeah, this is what we do, this is what we work on. We should be doing it ourselves.

ADRIANA: Yeah, exactly. So that's not even the problem. But even with that easy...I'll say easy, right? Because you're not having to deal with that hurdle. You have the hurdle of like, well, I've got some existing stuff now that I have to migrate. So one thing I'm wondering is, as you mentioned, there was some old OpenTracing stuff in place. And one of the things about OpenTelemetry is that they say they're backwards compatible with OpenTracing, OpenCensus. Now, which from my understanding means that if you have that stuff in place, you don't have to gut it right away.

ADRIANA: However, you probably don't want it to stay that way forever. So what do you say to folks who are in that position?

JACOB: A real it's a benefit that OTel provides these bridges to these legacy formats so that you can start using OTel and then get all of that in place. The thing that I always think about whenever doing these migrations, whether it's like a service, your telemetry, it doesn't really matter. The question is, how long do you want to be in a dual state? How long do you want to be in a state where you're potentially confusing someone on call? It's like the real crux of the issue is it's like always imagine yourself on call for whatever service you're changing, and someone gets paged at, like, 3:00 A.m.. Do you really want someone to have to reason about where your telemetry is coming from or how it's getting generated? You don't you really want that to be consistent. You don't want to have to ask the question, oh, is this like an OpenTracing thing? Is this an OTel thing? In the same way that if you're migrating a service and you have legacy service and new service, if you're in the dual state for a long time and you get a page for an upstream thing that's related to both of these downstream services, it's really frustrating to have to ask the question, which of these downstream things is affecting me? Right? Yeah, it'd be much easier if it was just I look at the single downstream, and I know that's the problem. Basically, it's shaving the decision tree for.

ADRIANA: This that you're doing.

JACOB: And so anything that you can do to remove the amount of time that you're in that dual state, removing those branches is going to do you better in the long run. The migration path is good that you can do this. There's another path, which I also think is a great option, where the OTel Collector probably supports whatever format you have right now. I'd be surprised if it doesn't. What you could do is just send rather than installing a bridge into your code, you could just send your legacy format to the Collector and have the Collector output, and then you can change your application to use OTel in whatever time frame you want, and then just have that sent to the collector, which already accepts OTLP. Yeah, right. And so that'll help you actually verify that the migration worked. You're already getting OTLP.

JACOB: You don't have to do anything with that. And then once you start sending OTLP from your application, you should see no difference in what's yeah, and that's a pretty verifiable thing. You could actually even use the file exporter on the OTel Collector to actually dump the data that you get. And then for Service A, run it with Jaeger for ten minutes, dump that data with the OTLP out, and then do Service A again, but with OTLP, dump that data for ten minutes, and then just see what it looks like, understand that you should see, like, a pretty minimal difference between those.

ADRIANA: Right.

JACOB: And that type of thing can give you so much confidence. And you can do that probably from your local environment without even needing to push it up. And so that's something that we didn't really consider as an option at the time. But had we thought of that, I definitely would have done it that way. It would have been a great option.

ADRIANA: Yeah.

JACOB: Where we could have just moved to OTel instantly and then backfill. Right. That's like a much easier path.

ADRIANA: Yeah, I agree. I mean, it's a very low friction approach, especially at my old company. They were using OpenTracing in a few spots, and so the mention of moving to OTel kind of sent people in a panic. Like, we have to re-instrument. Yes, we do. But hopefully never again after. But that idea sent people in a panic, and I had the same thought as you, which was like, yeah, just pump it through the Collector. Like, you don't have to change your code right away, but with the intention of eventually changing your code.

ADRIANA: Because now, correct me if I'm wrong, but if you continue on OpenTracing, you don't get to reap the benefits that you get with the whole OTel ecosystem, right? I mean, you don't end up with the traces and metrics correlation and the traces and logs correlation or any new updates to the API or SDK, right? You're kind of stuck with whatever OpenTracing was when it froze, when it was retired, basically.

JACOB: Yeah. Which means if there are any CVEs, you're kind of like, out of luck. Which is a bad state to be in.

ADRIANA: Totally.

JACOB: It's a really bad state to be in.

ADRIANA: Yeah. Awesome. Yeah, I definitely like that. Now, going back to the OTel Operator. So you said that you're doing mostly work around the metrics portion. It's the Target Allocator specifically, right?

JACOB: That's exactly. Right.

ADRIANA: Yeah.

JACOB: Now it's a bit more than that.

ADRIANA: Okay.

JACOB: But back then, like, last year was basically all target allocator stuff.

ADRIANA: Okay, cool.

JACOB: I can explain it. So basically when we started this process, someone from AWS had designed this thing called the Target Allocator. The goal of it was that you could distribute Prometheus works in targets. Targets are things that are like IP addresses, like a pod, a node, your old EC2 instance, whatever it is. You then go and scrape that instance to generate metrics. Prometheus works where it's a single monolith and you have a list of targets and it scrapes those and stores that data. You have to do this because if you have more than one instance of Prometheus, there's no way to tell which instance should scrape which thing. And so you're just going to be duplicating those scrapes. With OTel, we have the benefit of we don't need to store those metrics because we're just handing them off to the next thing with OTLP.

JACOB: So the Target Allocator's goal is to allow you to distribute those targets amongst a pool of collectors. So if you have 300 targets and you have three Collectors, the Target Allocator could say, I'm going to give each collector 100 targets evenly. Right, but you need to have 100.

ADRIANA: Collectors then to send it to...is that what that means?

JACOB: No, you would just have to have...sorry...if you have 300 targets and you have three Collectors, then it's 100 targets per collector and then you would just forward that to your destination. So it'd be like if your destination is Prometheus actually, which now accepts OTLP, you could have OTel do all of your scraping and then just send the data to Prometheus as your backend store, right? And that would be like a totally viable option.

ADRIANA: Gotcha.

JACOB: If you really wanted the ability to shard your scraping and scale how you scrape targets, that would be a pretty viable approach.

ADRIANA: Right, which Prometheus doesn't support the sharding right now, right?

JACOB: So Prometheus has experimental sharding support but it doesn't have the ability. So it can shard your scraping, but it can't figure out your querying effectively. So because Prometheus is also a database. If you have three instances of Prometheus that are scraping each different targets, you'll only be able to query...you'll have to query the right instance each time because it doesn't know how to do that communication...to ask for, "Who has this metric?" At least that's my understanding of it. Maybe they've changed that, but I don't think they have.

ADRIANA: Cool, okay. Yeah, that's super interesting. And so this allows you to scrape the Prometheus metrics which are not I mean, basically you're scraping it from wherever your source of Prometheus metrics is, right? It can be whatever, it can be coming from your infrastructure or whatever. And then this thing basically does the sharding for you and then it'll send your metrics to a destination. The destination could be Prometheus itself or it could be any observability backend that supports metrics essentially.

JACOB: Yeah, yeah, exactly. Cool. And that's the real benefit. I mean, we also open up by using the Target Allocator, we can be a little bit smarter as well. So the thing that Prometheus does, because it's all in one, is most of the targets that you get, you're just going to drop. The way that the scrape configs work is you get a target which has a bunch of metadata and then your scrape config determines whether or not you should actually get the data from that target.

ADRIANA: Got it.

JACOB: Even prior to making the request. And so usually you have to keep all of those in memory because you're constantly scraping them and you're constantly asking this question does the metadata match my scrape config? Does the metadata match my scrape config? And so forth. Whereas because we have the Target Allocator, we can actually just drop any targets that we know the Collector won't scrape okay in advance. So we only tell the Collector to process targets that it will end up scraping.

ADRIANA: Okay, so it's like a filter.

JACOB: Exactly. That's what we call it. We call it a relabel filter.

ADRIANA: Okay.

JACOB: So the real reason that this is really cool and why we added this in is because then we can also really evenly distribute targets to Collectors because we can say only. So if you have 300 targets, we use this strategy called consistent hashing, where you just hash each target and their metadata to assign that to a Collector ID. And so if you have, like, let's say, 500 targets, but you really are only going to end up scraping 100 of them after this filter, it would be better if you only tell the Collectors...if you only distribute the targets that you're going to end up scraping, because it's going to be more even rather than trying to fit in. It's the pigeonhole principle, right?

ADRIANA: Yeah.

JACOB: If you have three boxes and you have 500 targets, you might evenly distribute it at first, but eventually, when you go to scrape them, it might be uneven once you figure out what you're actually going to scrape.

ADRIANA: Right. By the time the Collector is receiving them, you've already just gotten the ones that you want, and so it can give you an even distribution of those. So then there isn't an imbalance, basically.

JACOB: Yeah, exactly.

ADRIANA: Nice. That is super cool.

JACOB: It's very clever.

ADRIANA: Every day. Yeah, that's very awesome. So is the Target Allocator only part of the OTel Operator? Is that something that's available as part of the standalone collector?

JACOB: So the Target Allocator is its own image. Like, it runs separate from the Collector binary. You could theoretically run it without the Operator. There are definitely some people that do that, but we don't support that as like, first class support. Reason why is that we do a lot of logic to rewrite. In order to make this work, you have to rewrite the Collector's configuration, and you also have to rewrite the Target Allocators configuration. It's just a bit of, like, data munging that we don't want users to have to do just because it's a little bit complicated. So we do it in the Operator for you.

ADRIANA: Yeah.

JACOB: There are people who will take what the Operator gives you, remove the Operator, and then just run it themselves.

ADRIANA: Right.

JACOB: And that's kind of a viable option. Yeah, but that's bespoke you'd have to do that yourself. And if you ask me a bunch of questions, I'll try to help you, but there's a certain point at which I can't help you. I don't know what you're doing.

ADRIANA: That sounds like someone's idea of, like, a fun weekend project.

JACOB: So we have a bunch of requests from people to enable the Target Allocator as part of the Helm chart, the raw Collector Helm chart. And I tried to do it, and it was so hard. It just proved so difficult to do. The config rewriting was so challenging because Helm isn't really a language. It gives you some go templating stuff, but at a certain point, it doesn't get you all the way there.

ADRIANA: Right.

JACOB: And so I wasn't able to make it work, and I eventually decided to give up because it was too much of a time.

ADRIANA: Yeah, that makes sense.

JACOB: Which is unfortunate because people ask for it a lot.

ADRIANA: Yeah, that's interesting.

JACOB: Yeah.

ADRIANA: Now, obviously there's an OTel Operator because obviously a lot of people run the Collector in Kubernetes. Do you know, is it common for people to run collectors outside of Kubernetes? I mean, obviously, if you're not a Kubernetes shop, I would imagine that would be the use case. But how common is it? Do you know?

JACOB: I don't know. I mean, I'm sure there are a bunch of people that do it, because I'm in my little Kubernetes world, I don't hear about it that often.

ADRIANA: Yeah, fair enough. Fair enough.

JACOB: I'm pretty isolated, but there are definitely people who just run Collectors as binaries on raw EC2 instances.

ADRIANA: Yeah.

JACOB: GCS instances. People are doing it, for sure.

ADRIANA: Yeah.

JACOB: I don't know. They probably have a whole different class of problems than the one.

ADRIANA: I know we're coming up on time, but I wanted to ask you quickly. Well, by the time this episode comes out, I don't know if KubeCon will have passed, but all the same, but do you have anything coming up at KubeCon that you want to talk about?

JACOB: I do indeed. So one of the main projects I'm doing for the Operator right now is adding support for the OpAMP protocol, which is a new part of OpenTelemetry that gives users the ability to do remote configuration management and agent configuration and Observability, sort of, with superpowers. And I'll be giving a talk with Andy Keller from ObserveIQ on OpAMP and how it's going to make your life a lot easier to manage these pools of Collectors that you have. So I am working on this project in the Operator group that will allow you to basically understand the topology of your Collectors in your Kubernetes cluster and also remotely configure them. Add in new features, push out updates, everything that basically allow your cluster's observability to be on autopilot for you.

ADRIANA: Nice. Who doesn't love that? Very cool.

JACOB: Stop thinking about it.

ADRIANA: Is that part of Observability Day, or is that part of the KubeCon, like the main conference?

JACOB: Main conference.

ADRIANA: Nice. Very nice. Yeah, very cool.

JACOB: I don't know how many people can fit in the room that I'm in, though. I thought they'd tell you that, but I guess they don't.

ADRIANA: It'll be a surprise the day of.

JACOB: It will. It'll be anywhere from five people to 500 people.

ADRIANA: I'm always nervous for these types of things. I think on the KubeCon schedule, you can see people already will sign up for your talk and you start seeing people signing up to attend your talk. And if it's like a small number, you're like, oh my God. And if it's a large number, you're also like, oh my God.

JACOB: Yeah, I'm very nervous. Yeah.

ADRIANA: Is like a very big deal. But yeah, this is awesome. Very excited for your talk. Oh, the other thing that I wanted to mention also, I don't know if it's going to come out by the time this comes out, but I do want to promote it because you were on the Maintainable podcast, you recorded an episode recently.

JACOB: I did indeed. I don't think that's out yet, but definitely something to look out for, though I have no idea when that'll be out.

ADRIANA: We will find out. Yeah, I think when I recorded an episode, I want to say like, in the spring and it came out a couple of months later.

JACOB: So probably there's a backlog of editing.

ADRIANA: Yeah, exactly.

JACOB: It's a whole process.

ADRIANA: I feel you. I have a backlog of editing for this too.

JACOB: Yeah, that's just how it happens.

ADRIANA: Yeah, totally. But anyway, something to look forward to as well, so you all keep an eye out for that. Now, before we part ways, do you have any interesting pieces of advice, be it like in tech or OTel or whatever, or any hot takes that you wanted to share with folks?

JACOB: I think the thing that I always say is just do something that you enjoy. If you're looking for a job, just like find something that work on a project that you enjoy. Find something that's weird and fun and doesn't really matter and just brings you some joy. I think that we all sort of forget that coding can be really fun and enjoyable and there's so many things out there that are so cool right now, especially. And there's so many things that I think have been forgotten just out of the consciousness. I used to do a lot of coding with SignalFX and Java to do UI building and games and stuff, and I haven't done that in so long, but I had so much fun doing that. So if you're looking for a job and you don't know how to do it, my best advice is to do a project that you find very fun and interesting and not just one that you think will play well on a résumé. Because if I'm interviewing you and you tell me about a project that you were so happy to do and really excited about, that's going to be ten times better than a project that you didn't really care about.

JACOB: Yeah, just have fun is my advice.

ADRIANA: Yeah, that is really great advice and I couldn't agree more. Yeah, and coding should be fun. It definitely puts me in a happy place when I'm working on an exciting project that I dream up some weird thing that I want to explore and then you learn so much and I don't know, you get a high. The programmer's high.

JACOB: Exactly.

ADRIANA: Totally down for that. Awesome. Cool. Well, thanks so much, Jacob, for joining today. So y'all, don't forget subscribe. Be sure to check the show notes for additional resources and to connect with us and with our guests on social media. Until next time...

JACOB: Peace out and Geek out.

ADRIANA: Geeking out is hosted and produced by me, Adriana Vileela. I also compose and perform the theme music on my trusty clarinet. Geeking out is also by my daughter, Hannah Maxwell, who, incidentally, designed all of the cool graphics. Be sure to follow us on all the socials by going to Bento Me slash Geeking Out.