<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Atoms]]></title><description><![CDATA[Physical automation to transform industry and move the world.]]></description><link>https://techblog.atoms.co</link><image><url>https://substackcdn.com/image/fetch/$s_!cYKl!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4f44e9-2651-435c-8dd6-9ac2d0711786_624x624.png</url><title>Atoms</title><link>https://techblog.atoms.co</link></image><generator>Substack</generator><lastBuildDate>Wed, 29 Apr 2026 02:37:23 GMT</lastBuildDate><atom:link href="https://techblog.atoms.co/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[City Storage Systems]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[engineering4citystoragesystems@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[engineering4citystoragesystems@substack.com]]></itunes:email><itunes:name><![CDATA[Atoms Tech]]></itunes:name></itunes:owner><itunes:author><![CDATA[Atoms Tech]]></itunes:author><googleplay:owner><![CDATA[engineering4citystoragesystems@substack.com]]></googleplay:owner><googleplay:email><![CDATA[engineering4citystoragesystems@substack.com]]></googleplay:email><googleplay:author><![CDATA[Atoms Tech]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Announcement: Splitter, First Open Source Project]]></title><description><![CDATA[We&#8217;re announcing our first open source project on github.]]></description><link>https://techblog.atoms.co/p/announcement-splitter-first-open</link><guid isPermaLink="false">https://techblog.atoms.co/p/announcement-splitter-first-open</guid><pubDate>Tue, 14 Apr 2026 17:21:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!E4Tg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E4Tg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E4Tg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 424w, https://substackcdn.com/image/fetch/$s_!E4Tg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 848w, https://substackcdn.com/image/fetch/$s_!E4Tg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 1272w, https://substackcdn.com/image/fetch/$s_!E4Tg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E4Tg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png" width="980" height="631" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:980,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32283,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://techblog.atoms.co/i/193656746?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E4Tg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 424w, https://substackcdn.com/image/fetch/$s_!E4Tg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 848w, https://substackcdn.com/image/fetch/$s_!E4Tg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 1272w, https://substackcdn.com/image/fetch/$s_!E4Tg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe70f5860-af25-4d4a-ab40-0b13cdbf9266_980x631.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;re announcing our first open source project on <a href="https://github.com/atoms-co/splitter">github</a>. </p><p>It&#8217;s a sharding service in the same family as Google&#8217;s Slicer. Or kind of like DataBrick&#8217;s Dicer, except it natively offers topology aware routing, exclusive grants, and other great features. </p><p>Splitter is the foundation for some of our consistent multi-regional primitives. In the upcoming months, we&#8217;ll open source other projects that are built on top of Splitter.</p><p>When we released a <a href="https://techblog.atoms.co/p/easy-as-pie-stateful-services-at">blog post</a> where we discussed the architecture and philosophy behind Splitter, we got a number of requests to open source the project. So we&#8217;re excited to finally do it.</p>]]></content:encoded></item><item><title><![CDATA[Cloudless Blob: Scaling past cloud provider limits while saving 25%]]></title><description><![CDATA[Implementing virtually unlimited blob storage on top of a less scalable blob storage]]></description><link>https://techblog.atoms.co/p/cloudless-blob-scaling-past-cloud</link><guid isPermaLink="false">https://techblog.atoms.co/p/cloudless-blob-scaling-past-cloud</guid><pubDate>Wed, 18 Feb 2026 15:45:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!r66r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Written by <a href="https://www.linkedin.com/in/fmogensen/">Frederik Mogensen</a>, member of our storage team who led the development of the Bucket-Gateway infrastructure.</em></p><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r66r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r66r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 424w, https://substackcdn.com/image/fetch/$s_!r66r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 848w, https://substackcdn.com/image/fetch/$s_!r66r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 1272w, https://substackcdn.com/image/fetch/$s_!r66r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r66r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png" width="1456" height="949" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:949,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:962841,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://techblog.atoms.co/i/186861860?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r66r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 424w, https://substackcdn.com/image/fetch/$s_!r66r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 848w, https://substackcdn.com/image/fetch/$s_!r66r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 1272w, https://substackcdn.com/image/fetch/$s_!r66r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe77caac-9a3b-4327-9f85-cd66f968db2e_1600x1043.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Blob storage (S3, GCS, etc) is an amazing technology that is the gold standard for large scale data storage. Modern Machine Learning is hard for us to imagine without blob storage feeding the data into GPU fleets. We are heavily reliant on blob storage.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://techblog.atoms.co/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading CloudKitchens! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>We moved all our blob storage onto Azure Blob Storage (see first post about<a href="https://techblog.cloudkitchens.com/p/cloudless-portable-blob"> Cloudless Blob</a>), and observed that our scale exceeded what Azure could provide out of the box. This post focuses on the technology that enabled us to scale past Azure&#8217;s bottlenecks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bWvN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bWvN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 424w, https://substackcdn.com/image/fetch/$s_!bWvN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 848w, https://substackcdn.com/image/fetch/$s_!bWvN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 1272w, https://substackcdn.com/image/fetch/$s_!bWvN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bWvN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png" width="325" height="262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:262,&quot;width&quot;:325,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bWvN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 424w, https://substackcdn.com/image/fetch/$s_!bWvN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 848w, https://substackcdn.com/image/fetch/$s_!bWvN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 1272w, https://substackcdn.com/image/fetch/$s_!bWvN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f4dc875-60ce-44d0-a181-87b84dd7d473_325x262.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Sticker popular on reddit forums</em></figcaption></figure></div><h2>Hitting the limits of Azure</h2><p>After moving our data analytics stack to Azure it became clear that Azure Blob Storage was not nearly as elastic and auto-scaling as our previous cloud provider. We started hitting the throughput limits on data in and out of our buckets as well as limits on the number of reads and writes per second.</p><h3>Our initial design</h3><p>First we will take a short look at our initial design for Blob Storage on Azure. When building on Azure Blob Storage, there are inherent scaling limitations and quotas that can impact performance and growth. Here are the most important limits we had to consider:</p><h4>1. Request Throughput per Storage Account</h4><ul><li><p>Default QPS (Requests per Second): between <code>20,000</code> and <code>40,000 requests/sec</code> depending on the region</p></li></ul><h4>2. Bandwidth Limits per Storage Account</h4><ul><li><p>Maximum Ingress (Data In to Blob):     <code>60 Gbps (~7.5 GB/sec)</code></p></li><li><p>Maximum Egress (Data Out of Blob): <code>200 Gbps (~25 GB/sec)</code></p></li></ul><h4>3. Storage Capacity per Storage Account</h4><ul><li><p>Maximum Data Stored: <code>5 PiB (Pebibytes)</code></p></li></ul><h4>4. Number of Storage Accounts</h4><ul><li><p>Maximum Storage Accounts per Subscription: 200 accounts</p></li></ul><p>As many of the limits are on Storage Accounts, we initially provisioned a handful of those, and spread the buckets we were migrating across them as fairly as possible. But, we started experiencing noisy neighbours on our Blob Storage Accounts. One example of this was an incident where writes and reads on a single Iceberg table saturated the limits for an entire Storage Account.</p><h3>Analytics batch jobs broke our frontends</h3><p>A large SQL job in Trino, loading many different parquet files from an analytics bucket would use all our bandwidth, and Azure would throttle all buckets on the same Storage Account. Resulting in frontends that did not load as the requests for HTML and JavaScript files were getting dropped by Azure.</p><h3>Unreliable Storage Accounts</h3><p>In general Azure Blob Storage SLAs looks like this</p><h4>Read requests SLA</h4><p>99.9%: For Locally-Redundant Storage (LRS), Zone-Redundant Storage (ZRS), and Geo-Redundant Storage (GRS), and for RA-GRS if retries aren&#8217;t used.</p><h4>Write requests SLA</h4><p>99.9%: For all standard redundancy options (LRS, ZRS, GRS, and RA-GRS).</p><p>This meant we often had failing requests towards Azure Blob Storage. Having to live with the risk of incurring up to 45 minutes of unavailability every month was completely unacceptable to us.</p><h2>Latencies</h2><p>We identified the latencies for the ListObjects operation as a significant area of concern. This particular operation has notably poor performance both in average and at the 99th percentile. This mainly impacted the overall reliability and user experience of our analytics stack. The high latencies often lead to cascading issues, including timeouts, degraded user interface responsiveness.</p><h2>Cost</h2><p>We pay a lot for Head and List requests because we make a lot of Head and List requests, and we would like this to be cheaper!</p><h1>Designing a scalable solution</h1><h2>Ask Azure to scale more horizontally?</h2><p>The initial solution we explored to help fix scaling problems was aimed at Azure. Asking for higher limits, and trying to understand why the infrastructure did not scale the same way as we were used to and hoped for.</p><p>This attempt worked a bit, and we got a few of the limits raised temporarily. But in the end we had to accept the fact that the cloud was not magical and in fact just someone else&#8217;s computer. With the temporary bump in limits to stop the bleeding we started looking at which possibilities we could find to solve our scaling challenges ourselves.</p><h2>Just use less blob storage?</h2><p>Another straightforward solution was to just use less blob storage. Ask our stakeholders to do less requests per second, store less data, and read/write fewer bytes. This solution is clearly the cheapest one, less usage of blob storage means less money spent on blob storage. At least as long as no one decides to store the data anywhere else instead. Unfortunately, during high rate limit periods we saw multiple proposals to move use cases from the cheap blob storage to much more expensive tech, such as SSDs, Redis Key-Value databases, or CockroachDB clusters.</p><h2>Create a new scalable infrastructure on top of Azure Blob Storage?</h2><p>We could not really provision a single Storage Account per bucket, as we have many more buckets that the Azure limit of 200 Storage Accounts allows. Even if we could just allocate a new Storage Account per bucket, this would not solve the throughput or storage limit problems for our analytics buckets. It would also not help with the low SLA on Storage Accounts.</p><h2>Implement a logical sharding of our buckets</h2><p>We ended up proposing a transparent sharding layer on top of Azure Blob Storage. Spreading blobs from different buckets across many Storage Accounts. This approach would:</p><ul><li><p>Remove the QPS limit for a single Storage Account (<code>20k-40K requests/sec</code>) by sharding requests across multiple Storage Accounts</p></li><li><p>Remove the throughput limit for a single Storage Account ( <code>60-200 Gbps</code>) by sharding reads and writes across multiple Storage Accounts</p></li><li><p>Remove the size limit for a single Storage Account (<code>5 PiB</code>) by sharding data across multiple Storage Accounts</p></li></ul><p>This approach included a new inhouse Metadata Layer to keep track of which blobs were stored on which Azure Storage Accounts. Using this Metadata Layer resulted in:</p><ul><li><p>Being able to respond to HeadObject and ListObject requests without having to query Azure Blob.</p><ul><li><p>Making both request types much cheaper.</p></li><li><p>Making ListObjects much faster.</p></li></ul></li><li><p>Being able to store multiple replicas of the same blob, consistently, on multiple Storage Accounts, in different regions.</p><ul><li><p>Removing the reliability problems we saw with single Storage Accounts.</p></li></ul></li></ul><p>As crazy as it sounds, we propose to implement a more scalable blob storage on top of a less scalable blob storage.</p><h2>Architecture and Metadata</h2><p>To address the scaling and cost issues, we added the option for any bucket to use a simple metadata storage layer. This layer keeps track of which blobs are stored on which Storage Accounts and what the current version, size, eTag, and custom metadata is. Whenever someone downloads an object or lists blobs in the bucket, the Bucket-Gateway checks the metadata database first. If possible, it answers right away - without needing to call Azure at all for HeadObject and ListObject operations, as well as for non-existing blobs. This is a huge improvement, since these types of requests are frequent and expensive.</p><p>For storing data, we shard blobs across multiple Storage Accounts&#8212;spreading the load and sidestepping Azure&#8217;s limits on requests, capacity, and bandwidth. The Metadata Layer records these assignments, which makes it easy to add new Storage Accounts when existing Storage Accounts are getting close to the quotas, and helps minimize disruption when any given account is slow or unavailable.</p><h3>Read flow</h3><p>When reading blobs from buckets that are using the sharding Metadata Layer the Bucket Gateway follows the following steps:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P7Ug!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P7Ug!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 424w, https://substackcdn.com/image/fetch/$s_!P7Ug!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 848w, https://substackcdn.com/image/fetch/$s_!P7Ug!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 1272w, https://substackcdn.com/image/fetch/$s_!P7Ug!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P7Ug!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif" width="1456" height="701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:701,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:544752,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/186861860?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P7Ug!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 424w, https://substackcdn.com/image/fetch/$s_!P7Ug!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 848w, https://substackcdn.com/image/fetch/$s_!P7Ug!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 1272w, https://substackcdn.com/image/fetch/$s_!P7Ug!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69b2412-da01-4edb-b455-cdc863e35f18_1856x894.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p>Authorizes the request</p><ol><li><p>Can the given user read the requested blob in the bucket?</p></li></ol></li><li><p>Looks up where the blob lives</p><ol><li><p>The metadata for any given blob is stored in the Metadata Storage Layer.</p></li><li><p>The actual blobs may be stored in any region, and Storage Account.</p></li></ol></li><li><p>If the blob is found in the Metadata Layer, the Bucket Gateway reads the actual blob content from the identified Storage Account and versioned blob name.</p></li></ol><h3>Write Flow</h3><p>The write flow is a bit more involved. The process detailed below will ensure that the blobs are written atomically.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kT16!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kT16!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 424w, https://substackcdn.com/image/fetch/$s_!kT16!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 848w, https://substackcdn.com/image/fetch/$s_!kT16!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 1272w, https://substackcdn.com/image/fetch/$s_!kT16!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kT16!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif" width="1456" height="701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:701,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1031625,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/186861860?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kT16!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 424w, https://substackcdn.com/image/fetch/$s_!kT16!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 848w, https://substackcdn.com/image/fetch/$s_!kT16!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 1272w, https://substackcdn.com/image/fetch/$s_!kT16!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a239e25-b0fc-45d8-89cb-8b6de5ef8d61_1856x894.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p>Authorizes the request</p><ol><li><p>Can the given user write the requested blob in the bucket?</p></li></ol></li><li><p>Validate the request is valid with the current metadata.</p><ol><li><p>E.g the If-None-Match header, that requires that the blob does not already exist.</p></li></ol></li><li><p>Use the Placement Algorithm detailed below to find the target Storage Account Shard for the new/updated blob content</p></li><li><p>Upload the content under a new unique versioned name in the shard.</p><ol><li><p>If the upload fails, due to a problem with the Storage Account (rate limiting, throughput limiting, unavailability etc), use the Placement Algorithm again to choose an alternative Storage Account and retry.</p></li></ol></li><li><p>Commits the new version of the blob to the Metadata Layer with the new backing location, size, Custom S3 Metadata etc.</p><ol><li><p>If any step fails, we can safely return an error to the client without risking any externally visible inconsistency.</p></li><li><p>Any uploaded content that fails to be added to the metadata store will be cleaned up asynchronously</p></li></ol></li><li><p>Async cleanup of old versions</p></li></ol><h3>Placement algorithm</h3><p>The algorithm used to assign a blob to one of the many shared Storage Accounts is fairly simple.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NXqt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NXqt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 424w, https://substackcdn.com/image/fetch/$s_!NXqt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 848w, https://substackcdn.com/image/fetch/$s_!NXqt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 1272w, https://substackcdn.com/image/fetch/$s_!NXqt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NXqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif" width="1456" height="701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:701,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:463331,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/186861860?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NXqt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 424w, https://substackcdn.com/image/fetch/$s_!NXqt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 848w, https://substackcdn.com/image/fetch/$s_!NXqt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 1272w, https://substackcdn.com/image/fetch/$s_!NXqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25f03cf-c085-4080-a44a-4acfee428e71_1856x894.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p>From all Storage Accounts. filter out all that are:</p><ol><li><p>Not marked as electable for new blobs (accounts that are too close to different limits)</p></li><li><p>Marked by the circuit breaker as currently un-stable (has had multiple failed read/writes within the last 5 minutes)</p><ol><li><p>If all circuit breaker as open, we pick a random to try anyway</p></li></ol></li></ol></li><li><p>We are now left with all accounts that will work. We choose the destination by</p><ol><li><p>Picking a random account in the current region, if one exists</p></li><li><p>Picking any random account</p></li></ol></li></ol><p>This approach ensures that we shard blobs fairly across all active Storage Accounts.</p><h3>Scaling process</h3><p>When a single Storage Account gets close to full utilization, it&#8217;s removed from the set of electable accounts. When the current set of electable accounts gets too small, a handful of new ones are added. The Placement Algorithm will ensure that new blobs only land on accounts with free capacity.</p><h1>The results</h1><p><em>- Speed, Money, and Fame</em></p><h2>Sharding</h2><p>Looking back at the initial four pain points the Metadata Layer has improved our blob storage on all dimensions. The Metadata Layer has helped us scale beyond Azure&#8217;s built-in limits, slash costs for common operations, and keep our system responsive for users.</p><h3>Scaling and Reliability</h3><p>Scaling and reliability has improved for both reads and writes.</p><p>The main problems with hot buckets hitting the Storage Accounts limits are gone because of the new sharding strategy the Bucket Gateway uses to shard buckets across many Storage Accounts. Failed writes are much less frequent due to the Bucket Gateway being able to retry writes on alternative Storage Accounts when the initial one is degraded, and failed reads are gone as Azure no longer has to throttle the Accounts.</p><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xfPN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xfPN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 424w, https://substackcdn.com/image/fetch/$s_!xfPN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 848w, https://substackcdn.com/image/fetch/$s_!xfPN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 1272w, https://substackcdn.com/image/fetch/$s_!xfPN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xfPN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png" width="1456" height="333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:333,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xfPN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 424w, https://substackcdn.com/image/fetch/$s_!xfPN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 848w, https://substackcdn.com/image/fetch/$s_!xfPN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 1272w, https://substackcdn.com/image/fetch/$s_!xfPN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f30dc5d-02ab-4cc3-b2fb-c26deeda7442_1600x366.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><em>Throttling events on one of the original Storage Accounts</em></figcaption></figure></div><h3>Latencies</h3><p>Allows Bucket-Gateway to answer all Head and List requests without querying Azure Blob Storage.</p><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WtDN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WtDN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 424w, https://substackcdn.com/image/fetch/$s_!WtDN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 848w, https://substackcdn.com/image/fetch/$s_!WtDN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 1272w, https://substackcdn.com/image/fetch/$s_!WtDN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WtDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png" width="1456" height="557" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:557,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WtDN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 424w, https://substackcdn.com/image/fetch/$s_!WtDN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 848w, https://substackcdn.com/image/fetch/$s_!WtDN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 1272w, https://substackcdn.com/image/fetch/$s_!WtDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c0daa2e-0e54-4c19-b258-8282946fe045_1600x612.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>P99 latency for ListObjects</em></figcaption></figure></div><h3>Cost</h3><p>Keeping all metadata in-house allows Bucket-Gateway to answer all Head and List requests without querying Azure Blob Storage.</p><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vOaT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vOaT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 424w, https://substackcdn.com/image/fetch/$s_!vOaT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 848w, https://substackcdn.com/image/fetch/$s_!vOaT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 1272w, https://substackcdn.com/image/fetch/$s_!vOaT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vOaT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png" width="1456" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vOaT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 424w, https://substackcdn.com/image/fetch/$s_!vOaT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 848w, https://substackcdn.com/image/fetch/$s_!vOaT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 1272w, https://substackcdn.com/image/fetch/$s_!vOaT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5197628d-0d3e-4072-88d7-5450dc3cf480_1600x327.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><em>Cost for all Blob operations running through the Bucket Gateway per day. The decrease shows when the Metadata Layer started answering Head and List requests.</em></figcaption></figure></div><p>The Metadata Layer with versioned blob names also allows us to implement a consistent in-cluster cache to read from. But that&#8217;s for another blog post.</p><p><em>Cover photo by <a href="https://unsplash.com/@frankiefoto">https://unsplash.com/@frankiefoto</a></em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://techblog.atoms.co/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading CloudKitchens! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Outperforming Industry Standard CRDT Implementations]]></title><description><![CDATA[90% less latency and 4x less memory overhead than when we used Ditto]]></description><link>https://techblog.atoms.co/p/protocol-buffer-crdts-outperforming</link><guid isPermaLink="false">https://techblog.atoms.co/p/protocol-buffer-crdts-outperforming</guid><pubDate>Wed, 07 Jan 2026 17:36:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QGay!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Written by <a href="https://www.linkedin.com/in/raperez/">Roberto Perez</a>, Adam Share, and <a href="https://www.linkedin.com/in/michael-olson/">Michael Olson</a>, engineers who helped build Otter POS.</em></p><p>When we set out to sync state across 10+ devices per restaurant without a leader, we started where most teams do: off-the-shelf conflict-free replicated data type (CRDT) libraries. They seemed to work as advertised, but then we measured performance on the low-end Android tablets our customers actually use and we were shocked: database operations were too slow, memory overhead was too high, and device performance tanked.</p><p>So we went back to first principles and built something different. By rethinking how version metadata relates to business data, we achieved a <strong>90% reduction in database latency</strong> and <strong>4x less memory overhead</strong> compared to the container-based approach that dominates the CRDT landscape.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QGay!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QGay!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 424w, https://substackcdn.com/image/fetch/$s_!QGay!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 848w, https://substackcdn.com/image/fetch/$s_!QGay!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!QGay!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QGay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png" width="1600" height="1164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1164,&quot;width&quot;:1600,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:424037,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QGay!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 424w, https://substackcdn.com/image/fetch/$s_!QGay!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 848w, https://substackcdn.com/image/fetch/$s_!QGay!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!QGay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69cae189-b2e1-4923-8d13-5b121ab319f4_1600x1164.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: DB operation latency plummeted once we dropped container-based CRDTs.</figcaption></figure></div><p>This post walks through the CRDT architecture that most off-the-shelf libraries use, where it breaks down for structured data, and the insight that lets us dramatically outperform it. If you&#8217;re syncing protobuf messages between distributed nodes, this approach may work for you too.</p><h2>Why CRDTs?</h2><p>A burger joint cannot stop selling burgers just because the internet goes down. The countertop point-of-sale, the kitchen display system, the self-ordering kiosk at the front&#8211;all of these devices need to be in sync, regardless of network connectivity. Requiring backend server coordination is not an option. We evaluated other approaches:</p><p><strong>On-prem server</strong>: Additional hardware at each location creates a single point of failure and operational complexity that doesn&#8217;t scale.</p><p><strong>Primary device coordination</strong>: Restaurant environments are hostile: devices drop from WiFi constantly, tablets overheat, someone spills liquid near a power supply. When your coordinator goes down during the dinner rush, everything stops.</p><p><strong>Consensus algorithms (Raft, Paxos)</strong>: Many restaurants run 2-3 devices. Two devices are clearly unworkable for quorum-based algorithms.</p><p>With CRDTs, we remove the need for coordination by embracing concurrent updates across replicas, leaning on an algorithm that deterministically resolves inconsistencies. Devices can update their local state independently, regardless of connectivity status, and then synchronize their data when they reconnect. There&#8217;s no single source of truth, no single point of failure, and data in the mesh is guaranteed to eventually converge.</p><h2>The Standard Approach: Container-Based CRDTs</h2><p>Most CRDT libraries follow a common architectural pattern: wrap every value in a container that implements a <a href="https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#State-based_CRDTs">commutative, associative, and idempotent &#8220;merge&#8221; function</a>.</p><p>For &#8220;last-write-wins&#8221; (LWW) data types, this means wrapping both the data and its version metadata in each container. This is the approach used by <a href="https://www.ditto.live/">Ditto</a>, <a href="https://automerge.org/">Automerge</a>, <a href="https://github.com/yjs/yjs">Y-CRDT</a>, and most open-source implementations.</p><pre><code>// The standard approach: wrap every value in a &#8216;mergeable&#8217; data type
interface Mergeable {
    /** Deterministically merges a remote state with this one. */
    fun merge(remote: Mergeable): MergeResult
}

enum class MergeResult {
    OURS, // Our data was unchanged (all remote data was ignored)
    THEIRS, // The new data is now equal to the remote data
    NEW, // The new data is not equal to ours or remote.
}

class LWWRegister&lt;T&gt;(
    val value: T,
    val version: Long
) : Mergeable

class LWWMap&lt;K, V&gt;(
    val entries: Map&lt;K, LWWRegister&lt;V&gt;&gt;
): MutableMap&lt;K, V&gt;, Mergeable

// Your data becomes deeply nested containers
class Order(
    val customerId: LWWRegister&lt;String&gt;,
    val status: LWWRegister&lt;OrderStatus&gt;,
    val items: LWWMap&lt;String, OrderItem&gt;,
    // ... every field wrapped
) : Mergeable</code></pre><p>This pattern is conceptually elegant, each value carries its own merge logic. But elegance doesn&#8217;t always equate to performance.</p><h3>The Hidden Costs</h3><p>When we deployed a container-based CRDT solution in production, the costs became clear.</p><p><strong>Memory explosion</strong>: A 1KB protobuf blob became 4-5KB with containers. Each field needed a wrapper with type metadata, value container, and version info, even if they were identical across multiple fields. Average orders were 4-5KB, but complex orders reached 100KB before applying CRDT metadata leading to memory spikes on low-end devices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mczR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mczR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 424w, https://substackcdn.com/image/fetch/$s_!mczR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 848w, https://substackcdn.com/image/fetch/$s_!mczR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 1272w, https://substackcdn.com/image/fetch/$s_!mczR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mczR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png" width="1456" height="840" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:840,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mczR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 424w, https://substackcdn.com/image/fetch/$s_!mczR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 848w, https://substackcdn.com/image/fetch/$s_!mczR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 1272w, https://substackcdn.com/image/fetch/$s_!mczR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42138db-30a4-4afb-ae9f-abddb517a431_1600x923.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: An order modeled using the container approach with field-level versioning</figcaption></figure></div><p><strong>Mapping overhead</strong>: For type safety and interoperability with other systems, our data models are defined using <a href="https://protobuf.dev/">Protocol Buffers</a>. This allows us to interact with the data in a natural way, provides clear schema update rules, and enables the reuse of several pieces of infrastructure already available around this technology.</p><p>However, this choice also means that we had to implement adapters to map domain object fields into generic container types. As a result, every read and write operation to the database required an O(F x D) transformation (where F is the number of fields and D is the depth).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_B1Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_B1Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 424w, https://substackcdn.com/image/fetch/$s_!_B1Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 848w, https://substackcdn.com/image/fetch/$s_!_B1Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 1272w, https://substackcdn.com/image/fetch/$s_!_B1Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_B1Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png" width="1456" height="848" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:848,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_B1Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 424w, https://substackcdn.com/image/fetch/$s_!_B1Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 848w, https://substackcdn.com/image/fetch/$s_!_B1Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 1272w, https://substackcdn.com/image/fetch/$s_!_B1Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8823229-bc5f-43a7-98db-9187ba2b3ffc_1600x932.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: Read and write operations using containers.</figcaption></figure></div><p><strong>Version redundancy</strong>: When an actor updates multiple related fields, they share the same version. Yet container-based systems store identical version info for each field.</p><p>For applications syncing small documents or text, these costs are acceptable. But our devices aren&#8217;t just syncing&#8212;they&#8217;re simultaneously generating bitmaps for receipt printing, fetching menus, processing payments, and rendering content across multiple screens. The CRDT layer competes for limited CPU and memory with everything else.</p><h2>Our Approach: Separate Version from Data</h2><p>Traditional CRDT libraries embed versions within values. Our key insight: separate them instead. Maintain a <strong>parallel version tree</strong> that mirrors your data structure. Remove redundant nodes in the version tree by only storing entries for fields that differ from the base version.</p><p>For example, in Figure 4 all fields were set at once, so they inherit the root version. <strong>No per-field version storage needed</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WmFH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WmFH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 424w, https://substackcdn.com/image/fetch/$s_!WmFH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 848w, https://substackcdn.com/image/fetch/$s_!WmFH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 1272w, https://substackcdn.com/image/fetch/$s_!WmFH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WmFH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png" width="1456" height="832" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:832,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WmFH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 424w, https://substackcdn.com/image/fetch/$s_!WmFH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 848w, https://substackcdn.com/image/fetch/$s_!WmFH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 1272w, https://substackcdn.com/image/fetch/$s_!WmFH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289082a2-710c-4e85-a023-4d5f6cc4fbd2_1600x914.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: A newly created order modeled using a parallel version tree with version inheritance.</figcaption></figure></div><p>When a device updates only the status field, then only that field gets a version entry. The other fields continue inheriting from the base version. This gives us <strong>O(m) space complexity</strong> where m = modified fields, instead of O(F) for all fields.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G6yB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G6yB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 424w, https://substackcdn.com/image/fetch/$s_!G6yB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 848w, https://substackcdn.com/image/fetch/$s_!G6yB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 1272w, https://substackcdn.com/image/fetch/$s_!G6yB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G6yB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png" width="1456" height="884" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:884,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G6yB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 424w, https://substackcdn.com/image/fetch/$s_!G6yB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 848w, https://substackcdn.com/image/fetch/$s_!G6yB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 1272w, https://substackcdn.com/image/fetch/$s_!G6yB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58db624-bdaa-4134-90ae-1e8cc8f06c6e_1600x971.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 5: The partial update to an order modeled using a parallel version tree.</figcaption></figure></div><p>The data layer remains a clean protobuf domain object, free of any version semantics. The parallel version layer mirrors the data layer, tracking field-level modifications in a sparse tree. Fields without explicit entries inherit the base version from their parent node, providing massive memory savings for typical usage patterns like data that has many fields but only a few change over time.</p><h3>Complexity Comparison</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lQzU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lQzU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 424w, https://substackcdn.com/image/fetch/$s_!lQzU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 848w, https://substackcdn.com/image/fetch/$s_!lQzU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 1272w, https://substackcdn.com/image/fetch/$s_!lQzU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lQzU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png" width="998" height="222" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:222,&quot;width&quot;:998,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29984,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/182661230?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lQzU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 424w, https://substackcdn.com/image/fetch/$s_!lQzU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 848w, https://substackcdn.com/image/fetch/$s_!lQzU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 1272w, https://substackcdn.com/image/fetch/$s_!lQzU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45ed9062-ca32-453a-a5bc-f0788ef14669_998x222.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 6: Complexity comparison. F, fields. D, depth. m, modified fields</figcaption></figure></div><p>With the container approach you must wrap each field of your data type in the CRDT container wrappers when you merge data types or read the value of a given field, incurring an operational complexity proportional to the number of fields times the depth of your structure.</p><p>In our approach, because version metadata is sparse and stored separately from the business data, reads are constant and merging is proportional only to the number of modified fields.</p><h2>How It Works</h2><p>Let&#8217;s walk through concrete examples using a restaurant order schema. First, the protobuf definition with CRDT options:</p><pre><code>import "com/css/protobuf/crdt/data/options/message_options.proto";
import "com/css/protobuf/crdt/data/options/field_options.proto";

message Order {
    string id = 1;
    string customer_name = 2;
    OrderStatus status = 3;

    // Nested message: field-level merge by default
    PaymentInfo payment = 4;

    // Nested message with atomic replacement
    Receipt receipt = 5 [
        (com.css.protobuf.crdt.data.options.crdt_merge_strategy) = REPLACE
    ];

    // Map: per-key versioning with tombstone TTL
    map&lt;string, string&gt; metadata = 6 [
&#9;(com.css.protobuf.crdt.data.options.crdt_tombstone_ttl) = 3600
    ];

    // Repeated with ID field: element-level merge
    repeated LineItem items = 7 [
        (com.css.protobuf.crdt.data.options.crdt_id_field) = 1
    ];

    // Counter: concurrent increments merge correctly
    int64 modification_count = 8 [
        (com.css.protobuf.crdt.data.options.crdt_merge_strategy) = COUNTER
    ];
}</code></pre><p>Last-write-wins doesn&#8217;t work for every data type. We still need to support custom per-field merge strategies. We leaned into a schema-driven approach using protobuf options. We&#8217;ll go through the default merge case, as well as some of the custom scenarios below.</p><h3>Scenario 1: Basic Field Merge</h3><p>Assume we have two devices making concurrent modifications to different fields of the order protobuf. Device A updates the customer field, while device B updates the status field.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qp1b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qp1b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 424w, https://substackcdn.com/image/fetch/$s_!Qp1b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 848w, https://substackcdn.com/image/fetch/$s_!Qp1b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp1b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qp1b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png" width="1456" height="723" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:723,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Qp1b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 424w, https://substackcdn.com/image/fetch/$s_!Qp1b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 848w, https://substackcdn.com/image/fetch/$s_!Qp1b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 1272w, https://substackcdn.com/image/fetch/$s_!Qp1b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccd7f2de-0ad5-483b-9197-c09715ad527c_1600x795.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 7: Two devices update different fields in the same order concurrently.</figcaption></figure></div><p>Both devices apply a local write to the order. When a device receives the other&#8217;s update a commutative merge is performed leveraging both version trees to reach a last-write-wins merge result.</p><pre><code>// Device A: Update customer name
val orderA = existingOrder.copy(customer_name = "Jane Doe")
val deltaA = orderResolver.applyLocalWrite(/* ... */, timestamp = 1000)

// Device B: Update status (at the same time, different field)
val orderB = existingOrder.copy(status = OrderStatus.CONFIRMED)
val deltaB = orderResolver.applyLocalWrite(/* ... */, timestamp = 1001)

// When B receives A's update, both changes merge using the version trees
val merged = orderResolver.resolveIncoming(
    deviceB.value,
    deviceB.versionTree,
    deviceA.value,
    deviceA.versionTree
)

// Result: customer_name="Jane Doe", status=CONFIRMED
// Both updates preserved - no data loss</code></pre><h3>Scenario 2: Map Field with Concurrent Updates</h3><p>Map fields require correct handling of deleted elements. Assume two devices update different keys in a map field, where one device also deletes a key.</p><pre><code>// Device A: Add a metadata entry
val orderA = existingOrder.copy(
    metadata = existingOrder.metadata + ("source" to "mobile_app")
)

// Device B: Add different entry and delete another
val orderB = existingOrder.copy(
    metadata = (existingOrder.metadata - "old_key") + ("priority" to "high")
)

// After merge: all additions preserved, deletion applied
// metadata = { "source": "mobile_app", "priority": "high", ... }
// "old_key" removed (tombstoned)
val merged = orderResolver.resolveIncoming(/* ... */)</code></pre><p>Deleted entries persist in the version tree as tombstones allowing the resolvers to merge map changes correctly across devices. The crdt_tombstone_ttl option allows you to define a time-to-live (TTL) for tombstone entries to bound memory growth.</p><h3>Scenario 3: Repeated Field with ID-Based Merge</h3><p>Lists present a unique challenge during merge resolution given the sorting of elements could change across devices, even if the data is the same. Assume two devices modify different items in the order&#8217;s line items list.</p><pre><code>// Existing order has items: [{ id: "item-1", quantity: 1 }, { id: "item-2", quantity: 2 }]

// Device A: Update quantity on item-1
val itemsA = existingOrder.items.map { item -&gt;
    if (item.id == "item-1") item.copy(quantity = 3) else item
}
val orderA = existingOrder.copy(items = itemsA)

// Device B: Update price on item-2
val itemsB = existingOrder.items.map { item -&gt;
    if (item.id == "item-2") item.copy(price_cents = 999) else item
}
val orderB = existingOrder.copy(items = itemsB)

// After merge: both item updates preserved
// items = [{ id: "item-1", quantity: 3 }, { id: "item-2", quantity: 2, price_cents: 999 }]
val merged = orderResolver.resolveIncoming(/* ... */)</code></pre><p>The crdt_id_field option tells the resolver which field identifies each element. Without it, the entire list would need to be replaced atomically (i.e. last-write-wins on the entire list).</p><h3>Scenario 4: Counter Field</h3><p>When an integer or long field is marked with the COUNTER merging strategy, a <a href="https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#G-Counter_(Grow-only_Counter)">G-Counter structure</a> is stored in the version tree. Counter fields use per-actor tracking internally so that each device&#8217;s contribution is recorded separately. The counter value is the sum of all contributions, ensuring concurrent increments never lose updates, whereas a naive last-write-wins approach would lose all but one of them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TWLb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TWLb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 424w, https://substackcdn.com/image/fetch/$s_!TWLb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 848w, https://substackcdn.com/image/fetch/$s_!TWLb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 1272w, https://substackcdn.com/image/fetch/$s_!TWLb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TWLb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png" width="1456" height="808" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:808,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TWLb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 424w, https://substackcdn.com/image/fetch/$s_!TWLb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 848w, https://substackcdn.com/image/fetch/$s_!TWLb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 1272w, https://substackcdn.com/image/fetch/$s_!TWLb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7961fde8-8ae1-4b9c-81d5-31f8a542108e_1600x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 8: G-Counter structure stored inside the VersionNode tree.</figcaption></figure></div><p>If two devices increment a counter concurrently, both increments are preserved.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zjdG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zjdG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 424w, https://substackcdn.com/image/fetch/$s_!zjdG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 848w, https://substackcdn.com/image/fetch/$s_!zjdG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 1272w, https://substackcdn.com/image/fetch/$s_!zjdG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zjdG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png" width="1456" height="725" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:725,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zjdG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 424w, https://substackcdn.com/image/fetch/$s_!zjdG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 848w, https://substackcdn.com/image/fetch/$s_!zjdG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 1272w, https://substackcdn.com/image/fetch/$s_!zjdG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd2d60c-2e5d-4fde-b5df-6b4f845e2eeb_1600x797.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 9: Concurrent counter updates</figcaption></figure></div><pre><code>// Both devices start with modification_count = 5

// Device A: Increment by 1
val orderA = existingOrder.copy(modification_count = 6) // 5 + 1
val deltaA = orderResolver.applyLocalWrite(/* ... */)

// Device B: Increment by 2
val orderB = existingOrder.copy(modification_count = 7) // 5 + 2
val deltaB = orderResolver.applyLocalWrite(/* ... */)

// After merge: modification_count = 8 (5 + 1 + 2)
// Counter tracks per-actor contributions and sums them</code></pre><h3>Scenario 5: Atomic Replacement vs Field Merge</h3><p>Different fields might require different merge strategies. For example, you may want to REPLACE on updates to the receipt field, but MERGE on updates to the payment field.</p><pre><code>// Device A updates payment method
val orderA = existingOrder.copy(
    payment = existingOrder.payment.copy(method = "credit_card")
)

// Device B updates payment amount
val orderB = existingOrder.copy(
    payment = existingOrder.payment.copy(amount_cents = 5000)
)

// After merge: both payment fields preserved
// payment = { method: "credit_card", amount_cents: 5000, ... }

// But for receipt (REPLACE strategy):
// Device A generates new receipt
val orderA = existingOrder.copy(
    receipt = Receipt(pdf_data = newPdfA, generated_at = "10:00")
)

// Device B also generates receipt
val orderB = existingOrder.copy(
    receipt = Receipt(pdf_data = newPdfB, generated_at = "10:01")
)

// After merge: Device B's receipt wins entirely (higher timestamp)
// No partial merge of receipt fields&#8212;it's atomic</code></pre><p>Use REPLACE for fields where partial merges don&#8217;t make sense: binary data, generated content, or tightly coupled field groups.</p><h2>Results</h2><p>Our approach is significantly more efficient and adds only minimal memory overhead to your existing data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QqCG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QqCG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 424w, https://substackcdn.com/image/fetch/$s_!QqCG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 848w, https://substackcdn.com/image/fetch/$s_!QqCG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 1272w, https://substackcdn.com/image/fetch/$s_!QqCG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QqCG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png" width="1082" height="344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:344,&quot;width&quot;:1082,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52287,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/182661230?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QqCG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 424w, https://substackcdn.com/image/fetch/$s_!QqCG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 848w, https://substackcdn.com/image/fetch/$s_!QqCG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 1272w, https://substackcdn.com/image/fetch/$s_!QqCG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f82d79e-d520-4aae-8164-96e22615934c_1082x344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 10: F, number of fields. D, depth. m, number of modified fields.</figcaption></figure></div><p><strong>Why do we win?</strong> With a parallel version tree divorced from business data, we eliminate per-field CRDT container wrappers which lead to O(F x D) complexity during read and write operations. By only storing modified fields in our version tree, memory overhead plummets and we can achieve a massive reduction in DB operation latency.</p><p>But dropping the containers doesn&#8217;t come for free. You still need to support custom per-field merge strategies. We leaned into a schema-driven approach using protobuf options: MERGE for field-level resolution, REPLACE for atomic updates, COUNTER for commutative operations, and ID-based lists for element-level tracking.</p><h2>Takeaways</h2><p>Separating versions from data is less natural and requires additional information to be stored. Since the additional information was small in practice, our approach reduced CRDT memory usage by 4-5x when combined with various optimizations.</p><p>Our approach works for any application syncing structured protobuf data across distributed nodes. We use it in production and plan on scaling it to thousands of restaurants, allowing Otter POS customers to handle network partitions and concurrent updates seamlessly on low-end Android hardware.</p><p>If you know an even better approach, let us know.</p>]]></content:encoded></item><item><title><![CDATA[Deployment Confidence in Era of AI Coding]]></title><description><![CDATA[Significant reliability improvements through in-house canary]]></description><link>https://techblog.atoms.co/p/deployment-confidence-in-era-of-ai</link><guid isPermaLink="false">https://techblog.atoms.co/p/deployment-confidence-in-era-of-ai</guid><pubDate>Thu, 06 Nov 2025 17:54:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ac0u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ac0u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ac0u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 424w, https://substackcdn.com/image/fetch/$s_!Ac0u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 848w, https://substackcdn.com/image/fetch/$s_!Ac0u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 1272w, https://substackcdn.com/image/fetch/$s_!Ac0u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ac0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png" width="1238" height="882" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:882,&quot;width&quot;:1238,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ac0u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 424w, https://substackcdn.com/image/fetch/$s_!Ac0u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 848w, https://substackcdn.com/image/fetch/$s_!Ac0u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 1272w, https://substackcdn.com/image/fetch/$s_!Ac0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41c23de8-c1cb-46e6-9573-8d62284a9587_1238x882.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by <a href="https://www.linkedin.com/in/dauginis">Steponas Dauginis</a> and Sean Chen</em></p><p>We see more code being written by GenAI over time. Due to careful human inspection, we aren&#8217;t yet seeing a regression in the quality of code (<a href="https://techblog.cloudkitchens.com/p/study-and-update-on-genai-devex">link</a>). But if trends continue, companies will want solid and lighterweight mechanisms in place to prevent bad code from shipping to customers. Robust canarying is one such mechanism.</p><p>In a canary deployment, new software is rolled out to a subset of traffic before wider distribution. If problems are detected, the software can be rolled back without impacting everyone.</p><p>At CloudKitchens, over 80% of all our service releases are conducted via canary, and over 95% of releases for services in the critical path (<a href="https://techblog.citystoragesystems.com/p/reliable-order-processing">order fulfillment</a>) leverage canary. In Q3 2025, canary blocked more than 1100 bad releases across the company, many of which could have resulted in user-facing regressions, if not downright outages.</p><p>If you search for blog articles on this topic, you will find writeups describing an idealized version of it. Many of these posts are penned by companies advertising their canary solution, often as part of a broader CI/CD offering.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xOjR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xOjR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 424w, https://substackcdn.com/image/fetch/$s_!xOjR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 848w, https://substackcdn.com/image/fetch/$s_!xOjR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 1272w, https://substackcdn.com/image/fetch/$s_!xOjR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xOjR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png" width="392" height="900.9777777777778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1448,&quot;width&quot;:630,&quot;resizeWidth&quot;:392,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xOjR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 424w, https://substackcdn.com/image/fetch/$s_!xOjR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 848w, https://substackcdn.com/image/fetch/$s_!xOjR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 1272w, https://substackcdn.com/image/fetch/$s_!xOjR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908effd-9941-4e3f-995d-64e1cdd7a545_630x1448.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here we describe how canary adoption and usage evolved in practice at our company, including early missteps and later improvements.</p><p>We are not trying to sell you anything; instead, we will share insights into the essential properties of a canary system that you will want, regardless of whether you buy a solution or build your own (as we did!).</p><h1><strong>High-Stakes Deployments</strong></h1><p>At CloudKitchens, processing real-time food orders is at the heart of our business.</p><p>In the beginning, all order fulfillment ran in a single region on our cloud provider. When we deployed a new version of our service, the new software quickly handled 100% of customer traffic. If a bug was introduced, engineers would be paged to manually identify the faulty deployment and then roll it back &#8211; a process that could last over an hour. In fact, well over 75% of our outages were triggered by bad deployments. To make matters worse, not all problematic changes were even caught by our alerts in the first place.</p><p>To make ourselves resilient to datacenter outages, we began operating critical applications and supporting infrastructure across three separate regions. Conveniently, our deployment tools added a 60-minute delay between regional deployments. This delay allowed engineers to halt the deployment if issues were detected in the first region, effectively making it a deployment where the canary population was a whopping 33% of traffic (if problems were detected in time). Still, many bad deployments slipped through undetected until they reached region #3.</p><h1><strong>Canary Basics</strong></h1><p>In summary, we faced two closely related challenges:</p><ul><li><p>Knowing if something is broken.</p></li><li><p>Rolling back a deployment once there is enough signal that it is causing the breakage.</p></li></ul><h2><strong>Knowing Something Is Wrong</strong></h2><p>Even before we can make decisions at deployment time about whether a new software version is healthy to roll out fully, we need to answer: <em>&#8220;how do we know something is broken?&#8221;.</em></p><p>When engineers are required to manually create their own dashboards and alerts, this not only requires an excessive amount of their time, but it also leads to inconsistent monitoring across the company.</p><p>Therefore, even before we invested in <em>deployment</em> automation, we built <em>observability</em> automation that:</p><ol><li><p>Periodically scans our Kubernetes clusters to discover running services.</p></li><li><p>Associates each running service with <a href="https://prometheus.io/">Prometheus</a> metrics that match labels derived from the service&#8217;s unique identifier.</p></li><li><p>Generates dashboard panels and alert specifications from the discovered metrics and pushes them to Grafana.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s3Ti!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s3Ti!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 424w, https://substackcdn.com/image/fetch/$s_!s3Ti!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 848w, https://substackcdn.com/image/fetch/$s_!s3Ti!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 1272w, https://substackcdn.com/image/fetch/$s_!s3Ti!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s3Ti!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png" width="1456" height="528" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:528,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!s3Ti!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 424w, https://substackcdn.com/image/fetch/$s_!s3Ti!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 848w, https://substackcdn.com/image/fetch/$s_!s3Ti!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 1272w, https://substackcdn.com/image/fetch/$s_!s3Ti!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bb3487-8a85-44bd-8ccd-1cdb77788e8c_1600x580.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>An over-simplified data flow</em></p><p>The importance of standardizing on middleware and client libraries that emit consistent metrics cannot be overstated. As a concrete example, a Java service that serves inbound GRPC traffic, reads from CockroachDB, and writes to Kafka would receive a tailored dashboard soon after its initial deployment.</p><p>Some of its critical alerts would cover:</p><ul><li><p><strong>Container health:</strong> CPU request utilization, container memory limits, excessive container restarts. We leveraged <a href="https://github.com/kubernetes/kube-state-metrics">kube-state-metrics</a>.</p></li><li><p><strong>Application runtime health:</strong> JVM heap utilization, threadpool queue size, and GC pauses.</p></li><li><p><strong>RPCs:</strong> The availability and latency of each exposed GRPC endpoint. Users can also enable alerts on egress failures if appropriate.</p></li><li><p><strong>Database client performance:</strong> SQL statement success rate and latency.</p></li><li><p><strong>Kafka client performance:</strong> Enqueue success rate and latency.</p></li><li><p><strong>Excessive volume of error-severity logs</strong>.</p></li><li><p>. . .</p></li></ul><p>This automation meant that when an engineer deployed a service, they automatically received a standardized dashboard featuring highly relevant alerts &#8211; with no upfront payment on manual configuration. Updating a service &#8211; for example, adding a new RPC &#8211; would automatically result in new panels and alerts, requiring no continual maintenance either.</p><h2><strong>Rolling Forward Or Rolling Back</strong></h2><p>Having the required metrics in-place for each service, we can now stand up a rudimentary canary mechanism as follows.</p><p>For each service, we separate its instances into three groups: baseline, canary, and main. We then manipulate these groups during the following phases of our deployment:</p><ul><li><p><strong>Pre-Deploy:</strong> Only the <em>main</em> partition has running instances &#8211; this is steady-state behavior.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AOsj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AOsj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!AOsj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!AOsj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!AOsj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AOsj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png" width="1456" height="244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AOsj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!AOsj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!AOsj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!AOsj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb76feb-2dc8-4efa-aa07-b0de79fe7e85_1600x268.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p><strong>Create New Groups:</strong> We spin up additional <em>canary</em> instances running the new code and additional <em>baseline</em> instances running the previous code. The instances in these groups handle live traffic as soon as they are created. If there are 3 <em>main</em> instances at steady state, and we add 1 baseline <em>instance</em> and 1 <em>canary</em> instance during the deployment, then we are rolling out new software to 20% of traffic, assuming a round-robin load-balancing policy for our services.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d5nt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d5nt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!d5nt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!d5nt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!d5nt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d5nt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png" width="1456" height="244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66ac698e-2717-450e-921d-6d27199da73e_1600x268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!d5nt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!d5nt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!d5nt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!d5nt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66ac698e-2717-450e-921d-6d27199da73e_1600x268.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p><strong>Analyze Metrics:</strong> We compare the behavior of the <em>baseline</em> and <em>canary</em> partitions by checking the metrics we already configured for this service, pitting the matching metric from each partition against one another. As its name suggests, we use the <em>baseline</em> partition as the basis for comparison (rather than the <em>main</em> partition) so that we can monitor startup issues &#8211; such as unusual memory utilization or caching patterns.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F9ZQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 424w, https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 848w, https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 1272w, https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png" width="1456" height="477" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:477,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 424w, https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 848w, https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 1272w, https://substackcdn.com/image/fetch/$s_!F9ZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a16f7-e01b-4da6-a462-9b25f543f18d_1600x524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Make The Call: </strong>The system decides whether to roll forward (promote all <em>main</em> partition instances to the new version) or roll back based on failing checks. Roll forward decisions wait for a longer analysis window. But we roll back more quickly at the onset of a failing check.</p></li></ul><p><em>Roll forward to:</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8XCO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8XCO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!8XCO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!8XCO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!8XCO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8XCO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png" width="1456" height="244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8XCO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!8XCO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!8XCO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!8XCO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7989d6-3e3b-497c-92f9-ae11ef443217_1600x268.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>or roll back to:</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lmDN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lmDN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!lmDN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!lmDN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!lmDN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lmDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png" width="1456" height="244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lmDN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 424w, https://substackcdn.com/image/fetch/$s_!lmDN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 848w, https://substackcdn.com/image/fetch/$s_!lmDN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 1272w, https://substackcdn.com/image/fetch/$s_!lmDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0aadd2-dfe5-4226-be1b-09d2bd4e4841_1600x268.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We use the <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann-Whitney U test</a> to determine whether metric fluctuations are significant or simply random noise. While we considered <a href="https://en.wikipedia.org/wiki/Welch%27s_t-test">t-test</a> and <a href="https://en.wikipedia.org/wiki/Dynamic_time_warping">distance functions</a>, this method produced the fewest false positives in our evaluation dataset.</p><p><em><strong>&#128161; Takeaway: </strong>To bootstrap our canary deployment mechanism, we leveraged the same metrics that engineers already relied on to monitor their services in steady-state conditions.</em></p><h2><strong>Initial Results</strong></h2><p>This first pass was effective at catching the most glaring issues &#8211; the two most frequent classes of issue were degraded availability or latency for a modified RPC; and crashes / out-of-memory errors.</p><p>Though still primitive, adopting this approach for critical services across the company prevented several dozen production bugs each month. However, numerous gaps remained.</p><h1>Towards Maturity</h1><p>Below are the shortcomings of our original solution, and how we addressed each of them over a span of several months (and thousands of deployments). Comparing our Q3 stats against Q2, these changes enabled us to catch 82% more bad releases, while the total number of canary releases increased by only 28%.</p><h2>Granularity</h2><p>Consider the prior example, where a service is running on 18 instances at steady state. To achieve the smallest possible rollout, where we add 1 baseline instance and 1 canary instance during the deployment, our smallest rollout is 5%. But at our scale, this is still higher than desired, exposing more customers to new code. For example, if we process 1000 orders per minute, a bad deployment could affect up to 50 x 3 = 150 customers even if we could roll back within 3 minutes (though, we can lessen its impact via aggressive retries). Regardless, that&#8217;s far too many unhappy customers.</p><p>Even more problematic were critical services requiring very few instances. For a service running only 2 instances, the smallest rollout amounted to 25% (1 out of 4). That is not much better than the 33% from our makeshift canary that waits between regions.</p><p>To address this, we integrated our canary mechanism with our load-balancing solution. Using <a href="https://istio.io/">Istio</a> as our service mesh, we adapted our process to create three sets of <a href="https://istio.io/latest/docs/reference/config/networking/virtual-service/">VirtualService</a> resources (for main, baseline, and canary) to drive proportional traffic, adjusting weights gradually as we progress through the deployment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xd_O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xd_O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 424w, https://substackcdn.com/image/fetch/$s_!xd_O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 848w, https://substackcdn.com/image/fetch/$s_!xd_O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!xd_O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xd_O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png" width="1456" height="1009" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1009,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xd_O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 424w, https://substackcdn.com/image/fetch/$s_!xd_O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 848w, https://substackcdn.com/image/fetch/$s_!xd_O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!xd_O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd92d49f-9e7a-4a61-aab9-5d9b58ad761c_1504x1042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This decouples our canary rollout granularity from the service&#8217;s horizontal scaling.</p><p><em><strong>&#128161; Takeaway: </strong>Canary deployments must be fully integrated with the chosen load balancing solution in order to finely tune the rollout process, reducing the blast radius of potential failures.</em></p><h2><strong>Canarying for Indirect Failures</strong></h2><p>While guarding each service&#8217;s deployment with its own emitted metrics prevents many common issues, it doesn&#8217;t capture the whole picture.</p><p>Take for example, a bug where an important endpoint is deleted inadvertently. The canary mechanism would not have matching sets of metrics to compare endpoint performance between baseline and canary partitions; and furthermore, it cannot possibly know if the deletion was intentional or not (e.g. part of a planned migration). In practice, we have also seen outages where a backward-incompatible schema change is made, and canary instances actually exhibit lower error rates because only they can process the new schema definitions.</p><p>To address such end-to-end issues, we canary on traces across all our most important services. By tracking availability, latency, and throughput for these traces, we can monitor overall business health.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WsG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WsG_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 424w, https://substackcdn.com/image/fetch/$s_!WsG_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 848w, https://substackcdn.com/image/fetch/$s_!WsG_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!WsG_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WsG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png" width="1180" height="1600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1600,&quot;width&quot;:1180,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!WsG_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 424w, https://substackcdn.com/image/fetch/$s_!WsG_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 848w, https://substackcdn.com/image/fetch/$s_!WsG_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!WsG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F207809b1-95f1-4472-9843-7b75944d46ca_1180x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At deployment time, we compare regressions in end to end flows alongside service-specific metrics, helping us catch user-impacting outages.</p><p><em><strong>&#128161; Takeaway: </strong>Monitoring end-to-end business operations provides a comprehensive view that safeguards against subtle yet critical issues that are not visible in service-level metrics alone.</em></p><h2><strong>Statistical analysis algorithm</strong></h2><p>Originally we used the Mann-Whitney U test as the single metric evaluation strategy. Later on, we did a collaboration with our data scientists to introduce the Proportional Check algorithm, which uses <a href="https://en.wikipedia.org/wiki/Fisher%27s_exact_test">Fisher&#8217;s exact test</a> under the hood. For our use-case it performs much better on small data sets and can provide meaningful insight quicker. Also, it is a more easily understood option for all checks which work on proportions (e.g. % of failure vs. total request count).</p><h2><strong>Rollback Criteria</strong></h2><p>The initial version of our canary mechanism would roll back a deployment if any metrics <em>check</em> failed, erring on the side of caution. However, flaky metrics led to many unnecessary rollbacks, much to our engineers&#8217; frustration.</p><p>To better understand our canary deployments, we instrumented:</p><ul><li><p><strong>True positive, false positive, and false negative rates</strong> &#8212; these correspond to deployments rolled back due to real bugs, deployments rolled back unnecessarily, and deployments that slipped through despite real bugs. Some randomly picked releases are labelled manually, and the rest is done by a LLM-backed classification to validate the performance and its change over time.</p></li><li><p><strong>Time spent in analysis</strong> in each of the above scenarios.</p></li></ul><p>Our aim is to drive up true positive rates, drive down false positive and false negative rates, while minimizing analysis time (good deployments should not drag out, bad deployments should be rolled back ASAP).</p><p>Over time, we refined our default checks to optimize and above performance metrics. We also made significant investments towards our user interface.</p><p><em>This UI displays failing checks per release:</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zp1U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zp1U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 424w, https://substackcdn.com/image/fetch/$s_!Zp1U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 848w, https://substackcdn.com/image/fetch/$s_!Zp1U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 1272w, https://substackcdn.com/image/fetch/$s_!Zp1U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zp1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png" width="1456" height="1006" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1006,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Zp1U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 424w, https://substackcdn.com/image/fetch/$s_!Zp1U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 848w, https://substackcdn.com/image/fetch/$s_!Zp1U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 1272w, https://substackcdn.com/image/fetch/$s_!Zp1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19185b6a-1092-4c83-9ce8-8b6829f2e656_1600x1105.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>And it empowers users to make further adjustments on their own:</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1JfT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1JfT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 424w, https://substackcdn.com/image/fetch/$s_!1JfT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 848w, https://substackcdn.com/image/fetch/$s_!1JfT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 1272w, https://substackcdn.com/image/fetch/$s_!1JfT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1JfT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png" width="1456" height="1275" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac236689-e208-4d97-8e33-b65513591848_1600x1401.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1275,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!1JfT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 424w, https://substackcdn.com/image/fetch/$s_!1JfT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 848w, https://substackcdn.com/image/fetch/$s_!1JfT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 1272w, https://substackcdn.com/image/fetch/$s_!1JfT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac236689-e208-4d97-8e33-b65513591848_1600x1401.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As we introduced new classes of metrics checks (e.g. error log volumes), and as users fine-tuned thresholds, it became necessary to differentiate between metrics checks that were mature from those still in early development.</p><p>Ultimately, we categorized them into:</p><ul><li><p><strong>Critical: </strong>Immediate rollback upon failure to minimize impact.</p></li><li><p><strong>Warn: </strong>Pages the responsible engineer. They can then flag whether this is a real issue (prompting an immediate rollback). Well-performing checks are promoted into &#8220;Critical&#8221;.</p></li><li><p><strong>Debug: </strong>The check is logged (metrics collected), but no action is taken. We compare these after-the-fact to know how to modify them; or to promote them into &#8220;Warn&#8221;.</p></li></ul><p>By granting users both visibility over failing checks and ultimate control if the default rollback parameters were, we drove user trust and wider adoption. At this point, nearly three-quarters of all service deployments across our company are guarded by canary.</p><p>Still to do: Implementing backtesting capabilities using historical data to experiment with new parameters without initiating real deployments.</p><p><em><strong>&#128161; Takeaways:</strong></em></p><ul><li><p><em>Invest in instrumentation upfront. To know where to adjust rollback criteria, it is vital to first track true positives, false positives, false negatives, and analysis time on a per metric-basis.</em></p></li><li><p><em>Tiering enables iteration by distinguishing mature checks from those needing far more refinement before they can override a deployment.</em></p></li><li><p><em>Canary requires a great UI. This enables engineers to gain visibility on why the system made any given decision, and is a springboard for them to make adjustments for their own services. This garners trust and adoption.</em></p></li></ul><h1><strong>Conclusion</strong></h1><p>At this stage, we are far more confident in our ability to deploy new service code while minimizing adverse impact on our customers. There is still room for improvement. For example, we&#8217;re currently building more sophisticated rollout cohorts and tweaking AI techniques to further tune false positives and negatives.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://techblog.atoms.co/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://techblog.atoms.co/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Our journey to affordable logging]]></title><description><![CDATA[The architecture of our in-house Rust based logging engine]]></description><link>https://techblog.atoms.co/p/our-journey-to-affordable-logging</link><guid isPermaLink="false">https://techblog.atoms.co/p/our-journey-to-affordable-logging</guid><pubDate>Tue, 21 Oct 2025 17:00:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!c9Su!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c9Su!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c9Su!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 424w, https://substackcdn.com/image/fetch/$s_!c9Su!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 848w, https://substackcdn.com/image/fetch/$s_!c9Su!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 1272w, https://substackcdn.com/image/fetch/$s_!c9Su!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c9Su!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png" width="1024" height="939" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:939,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1457518,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/174284229?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c9Su!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 424w, https://substackcdn.com/image/fetch/$s_!c9Su!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 848w, https://substackcdn.com/image/fetch/$s_!c9Su!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 1272w, https://substackcdn.com/image/fetch/$s_!c9Su!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41683779-fd85-47e4-806a-ffca0244f1a0_1024x939.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by  <a href="https://www.linkedin.com/in/gaurav-nolkha-16416650/">Gaurav Nolkha</a> and <a href="https://www.linkedin.com/in/tomeksroka/">Tomek Sroka</a></em></p><p>When <a href="https://www.tryotter.com">Otter</a> first started to hyperscale, the cost of our logging stack grew out of control. Stackdriver (logging platform on GCP we used at the time) consumed 17% of our monthly cloud bill, growing 19% month-over-month. To mitigate without giving up observability we switched to self-managed OpenSearch, cutting costs by 80% but still paid way too much: $3.35 per 100GiB per month and spent significant time on cluster operations.</p><p>Even with a dedicated team optimizing OpenSearch, our clusters frequently degraded and our engineers started to lose trust in the observability platform. We were forced to eventually declare a code yellow and spent several months hardening the infrastructure, which was still expensive to run. We needed something radically different. Leveraging blob stores to provide reliable, scalable, cheap storage, we built LogProc: <strong>a logging engine optimized for low operational overhead and cost</strong>.</p><p>The results: LogProc now handles 750+ TiB of logs at 4.4x lower cost than self-hosted OpenSearch, with low operational overhead. We&#8217;re 50x cheaper than managed alternatives (like Elastic Cloud), and 70% of our engineers use it daily.</p><h3>Scaling Pains: Our Journey and Hurdles with OpenSearch</h3><p>OpenSearch offered powerful search capabilities, but at our scale, costs ballooned, making it one of our top 5 most expensive services.</p><p>The real killer was operational overhead. OpenSearch demanded specialized expertise just to stay running. We implemented a hot-cold architecture: SSDs for recent data, cheaper storage for archives, which reduced costs but added operational complexity and failure modes. Our team burned countless hours on cluster maintenance, index tuning, and firefighting. And we were losing.</p><p>We eventually realized that OpenSearch&#8217;s unit cost wouldn&#8217;t scale with our growth. We needed a cheaper, simpler, and more reliable alternative. Most of our challenges stemmed from managing our own storage, so we turned to blob storage, the de facto standard for large-scale data infrastructure. This became the foundation for our new logging engine.</p><h2>LogProc - A New Logs Datastore</h2><p>LogProc&#8217;s design centers on five core design decisions that directly address OpenSearch&#8217;s cost, reliability, and performance deficiencies.</p><p><strong>Cost Efficiency with Scalable Storage:</strong> With 70% of our costs in storage, we moved to blob storage (Azure Blob) for sizable cost reduction. Unlike OpenSearch&#8217;s requirement for replicas, blob storage provides built-in durability, eliminating the overhead of managing redundant data copies (replicas). The trade-off: accessing blob storage is two orders of magnitude slower than fetching data from SSDs, so we must manage query performance through parallel processing and smart caching strategies. This works because most logs are written once but rarely read. </p><p><strong>Simplified Reliability with Stateless Query Service:</strong> Logging must work when everything else is broken. We built a stateless query engine that requires zero coordination with sub-second startup times; unlike OpenSearch&#8217;s complex shard management, our nodes deploy independently. The trade-off: parallelizing and distributing queries requires thoughtful design, for example, using rendezvous hashing to maintain a high cache hit rate.</p><p><strong>Horizontal Scalability:</strong> LogProc scales seamlessly for both ingestion and querying by simply adding more nodes, without requiring rebalancing. No manual tuning required. With OpenSearch, adding nodes triggered massive shard rebalancing with enormous east-west traffic (intra-cluster data transfer during rebalancing), while removing nodes was a manual, hours-long operation to safely drain shards before scaling down.</p><p><strong>Effective Durability:</strong>&nbsp;For durability, we use RocksDB for high-performance local buffering before writing to blob storage, reducing the number of blob API calls (and creating fewer but larger blob objects) through batching. The trade-off: queries spanning the last ~15 minutes depend on the ingester; if an ingester node is down, results may be incomplete until it recovers.</p><p><strong>Choosing Rust for Performance and Safety:</strong> Rust delivers the performance and safety critical for data-intensive workloads. Despite the learning curve, our cluster now consumes ~10x fewer resources (CPU and memory) than OpenSearch, which suffered from Java&#8217;s GC pauses and memory bloat.</p><p>The result: cheap object storage, minimal indexing, and Rust&#8217;s efficiency created the most cost-effective logging solution we could build. Today, 70% of our engineers use LogProc daily via a custom Grafana plugin.</p><h2>Putting it All Together: LogProc Architecture</h2><p>LogProc&#8217;s architecture separates concerns into two independent paths: a stateful ingestion pipeline that batches logs into blob storage, and a stateless query engine that retrieves them on demand.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!20yk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!20yk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 424w, https://substackcdn.com/image/fetch/$s_!20yk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 848w, https://substackcdn.com/image/fetch/$s_!20yk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 1272w, https://substackcdn.com/image/fetch/$s_!20yk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!20yk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png" width="3584" height="2080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2080,&quot;width&quot;:3584,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:398496,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/174284229?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e42206a-0f5f-4149-a0d4-28e19c528cdf_3584x2080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!20yk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 424w, https://substackcdn.com/image/fetch/$s_!20yk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 848w, https://substackcdn.com/image/fetch/$s_!20yk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 1272w, https://substackcdn.com/image/fetch/$s_!20yk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc37f470-f4c1-4a01-a297-1305874950ad_3584x2080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Fig 1</strong>: LogProc architecture showing the ingestion pipeline (red), query path (green)</p><p>Before diving into how these components work, let&#8217;s establish the key concepts:</p><p><strong>Logstream:</strong> A logical grouping of logs from the same source, identified by region, namespace, app, and container (e.g., region=centralus, namespace=orders, app=order-central, container=oc). Think of it as a labeled bucket for related logs.</p><p><strong>Block:</strong> A unit of logs from a single logstream (up to 500MiB), uploaded to blob storage. Logs within a block are sorted by timestamp.</p><p><strong>Chunk:</strong> A ~1MiB subdivision within a block that enables partial downloads during queries. Chunks cover non-overlapping time ranges.</p><p><strong>Ingester:</strong> Stateful component that batches incoming logs using RocksDB, then uploads blocks to blob storage. Uses High Random Weight, a.k.a. <a href="https://en.wikipedia.org/wiki/Rendezvous_hashing">Rendezvous hashing</a> (HRW), a hashing variant that minimizes remaps, to distribute logstreams across instances.</p><p><strong>Querier:</strong> Stateless component that processes queries by fetching blocks from blob storage (or cache) and merging results. Also queries Ingesters for the most recent logs.</p><p><strong>Note on scope</strong>: This post focuses on LogProc&#8217;s core architecture: ingestion and querying. We&#8217;ve omitted two components: the Compactor (background process for merging blocks and enforcing retention) and our Grafana plugin (query interface) for brevity.</p><h2>Writing Logs</h2><p>The ingestion pipeline balances two competing goals: minimizing storage costs while enabling fast queries. How we <strong>partition, batch, and index</strong> logs during writes directly determines query performance. The implication is that there&#8217;s no separating storage decisions from query performance optimization.</p><p><strong>Partitioning Strategy: Drastically Reducing Search Space</strong></p><p>Logs are partitioned into logstreams based on source identifiers: region, namespace, app, and container. This is a fundamental improvement over our OpenSearch setup, which was partitioned only by time.</p><p>While OpenSearch&#8217;s inverted indexes made field-based searches efficient, time-only partitioning created severe hotspots. Recent shards received the vast majority of queries since users overwhelmingly search recent logs. This concentrated query load on a few <em>indexes</em> degraded performance for everyone, regardless of cluster capacity.</p><p><strong>The solution:</strong> By partitioning on the app label, we can dramatically narrow the search space. When users specify app and namespace (which we encourage in the UI), we only scan logstreams matching those labels. Instead of searching through terabytes of logs from all services, we search only the relevant logstreams, reducing the <strong>data volume</strong> 10-100x (proportional to the number of applications).</p><p><strong>Ingester Sharding: Sticky Logstreams for Efficiency</strong></p><p>Getting logstreams to the right Ingesters is critical for efficiency. We use <em>sticky logstreams</em>, ensuring logs from the same logstream go to the same Ingester. Although if a chosen Ingester fails to accept entries, the receiving Ingester stores them locally as a fallback, ensuring no data loss.</p><p>When an Ingester receives a batch of log entries from any source, it groups them by logstream and uses Rendezvous hashing to determine which Ingester should handle each logstream. It then forwards entries to the appropriate Ingesters via internal gRPC calls. Each Ingester maintains a live list of all active Ingesters in the cluster through service discovery.</p><p><strong>Why Rendezvous hashing?</strong> Unlike simple hash-mod approaches that remap nearly all logstreams when scaling, Rendezvous hashing minimizes disruption: only L/N logstreams need remapping when adding or removing an Ingester (where L = number of logstreams, N = number of Ingesters).</p><p>Ensuring logs from the same logstream always go to the same Ingester, enables:</p><ol><li><p>Builds denser blocks for better compression and fewer blob storage PUT API calls.</p></li><li><p>Avoids rebalancing storms by adding an ingester that remaps &#8776;L/N streams.</p></li><li><p>Simplifies failure: if an ingester is down, senders temporarily buffer or retry; once it is back up, it resumes from the local state.</p></li></ol><p><strong>Batching and Block Structure: Enabling Partial Downloads</strong></p><p>Each Ingester handles logs from multiple logstreams simultaneously. For each logstream, it accumulates logs in RocksDB (local persistent storage) until one of two conditions is met: </p><ol><li><p>The size threshold is greater than 500MiB of logs accumulated for that logstream. </p></li><li><p>Time threshold is greater than 15 minutes elapsed since the first log in that logstream.</p></li></ol><p><strong>Why batching matters for cost:</strong> Blob storage charges per PUT API call. Writing individual log lines would generate millions of expensive API calls per day. By batching into 500MiB blocks, we reduce API calls by orders of magnitude: a single PUT operation replaces tens of thousands of individual writes.</p><p>Once triggered, the Ingester creates a block for that specific logstream containing logs organized into ~1MiB chunks. Each chunk covers a non-overlapping time range and logs are chronologically sorted within chunks.</p><p><strong>Why chunks matter for queries:</strong> When a query targets a specific time range (or uses our full text search index), the query engine can download only relevant chunks instead of the entire block. This dramatically reduces data transfer and speeds up queries.</p><p>Each block and its indexes are uploaded to blob storage, and metadata (logstream identifier, time range, block id) is stored in PostgreSQL. This means an Ingester is continuously creating and uploading blocks for different logstreams as they hit their respective thresholds.</p><p><strong>The durability trade-off:</strong> RocksDB on persistent volumes acts as a write-ahead log (WAL), ensuring logs are durable even before blob upload. However, the most recent logs (last 15 minutes) haven&#8217;t reached blob storage yet and only exist on Ingesters.</p><p>For queries covering recent time ranges, the query engine must fetch data from both blob storage and active Ingesters. Since we don&#8217;t run Ingester replicas, if an Ingester pod goes down, real-time queries will return incomplete results for that pod&#8217;s logstreams until it recovers. When the pod restarts, RocksDB recovers the buffered logs and resumes uploading to blob storage.</p><p>In practice, this hasn&#8217;t been a concern: Ingester nodes have been stable, and temporary query incompleteness during node restarts is acceptable for our use case. If needed, we could add HA capabilities, but the current trade-off keeps operational complexity minimal.</p><p><strong>Index Creation: Filtering Before Scanning</strong></p><p>During block creation, the Ingester builds several indexes that allow the query engine to skip irrelevant blocks entirely:</p><p><strong>Field Path Index:</strong> XOR filter, a compact probabilistic membership filter (similar to Bloom) for &#8220;might contain&#8221; tests, stores all field paths present in the block (e.g., payload.fields.operation). If a query searches for a field that doesn&#8217;t exist in the block, we skip it without downloading.</p><p><strong>ID Index:</strong> An XOR-based filter, similar to a Field Path Index, helps determine if a specific block might contain a searched value. This filter uses values extracted from log entries via predefined regular expressions. It significantly accelerates &#8220;needle in a haystack&#8221; queries, such as searching for UUIDs or trace IDs, where the desired value is expected in very few locations.</p><p><strong>Free Text Index:</strong> FST (<a href="https://en.wikipedia.org/wiki/Finite-state_transducer">Finite State Transducer</a>: automaton for fast token/prefix matching) stores tokenized log content. Enables fast free-text searches by identifying which chunks within a block contain the search terms. Tokens are grouped into 2-token windows (e.g., &#8220;order not found&#8221; becomes &#8220;order not&#8221; + &#8220;not found&#8221;) to improve multi-word search accuracy.</p><p>These indexes answer the question: &#8220;Does this block contain relevant data?&#8221; before we spend time and bandwidth downloading it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BYLs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BYLs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 424w, https://substackcdn.com/image/fetch/$s_!BYLs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 848w, https://substackcdn.com/image/fetch/$s_!BYLs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 1272w, https://substackcdn.com/image/fetch/$s_!BYLs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BYLs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif" width="1200" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1702428,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/174284229?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BYLs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 424w, https://substackcdn.com/image/fetch/$s_!BYLs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 848w, https://substackcdn.com/image/fetch/$s_!BYLs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 1272w, https://substackcdn.com/image/fetch/$s_!BYLs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2b5f0b-02a2-45f8-8354-2cdc1ec32fc1_1200x609.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why This Beats OpenSearch</strong></p><p>OpenSearch also used size/time-based batching, but has critical differences with LogProc:</p><p><strong>Replica overhead eliminated:</strong> Blob storage provides built-in durability. OpenSearch requires storing replicas of both data AND maintaining replica compute nodes: doubling storage costs and compute overhead.</p><p><strong>Decoupled storage and compute:</strong> Our Ingesters are purely computational; they buffer and batch, but don&#8217;t store data long-term. OpenSearch nodes were tightly coupled; losing a node meant both losing compute capacity and triggering complex shard rebalancing.</p><p><strong>Simpler failure handling:</strong> When an Ingester fails, Rendezvous hashing routes logstreams to other instances with minimal remapping. No shard rebalancing, no replica promotion, no cluster coordination. OpenSearch&#8217;s shard allocation and replica management under load often led to degraded cluster states.</p><p>The result: we achieve efficient batching and fast queries without the operational complexity and cost overhead of maintaining a replicated cluster. In our largest cluster we have about 9000 logstreams, batching results in ~1 PUT API call every 15 minutes for each logstream and indexes help us filter out ~95% blocks on average.</p><h2><strong>Querying Logs</strong></h2><p>The query engine leverages the partitioning, chunking, and indexing decisions made during ingestion to minimize data transfer and maximize cache effectiveness.</p><h3>Query Flow</h3><p>When a user submits a query (e.g., &#8220;show me errors from app=orders in the last 24 hours&#8221;):</p><p><strong>1. Metadata Lookup:</strong> The lead Querier queries PostgreSQL to identify relevant blocks based on:</p><ul><li><p>Logstream filters (app, namespace, region, container)</p></li><li><p>Time range</p></li><li><p>Returns block metadata (location, time range, block ID)</p></li></ul><p><strong>2. Query Recent Logs:</strong> The query fans out to all Ingester pods to fetch logs from the last 15 minutes that haven&#8217;t been uploaded to blob storage yet.</p><p><strong>3. Work Distribution:</strong> Using Rendezvous hashing, the Querier that received the request assigns blocks to other Querier instances for parallel processing. Each block&#8217;s hash determines which Querier node will handle it; the same block always routes to the same Querier to maximize cache hit ratio.</p><p><strong>4. Block Filtering with Indexes:</strong> Before downloading any block, each Querier checks the block&#8217;s indexes (stored separately in blob storage):</p><ul><li><p><strong>Field Path Index:</strong> Does this block contain the queried fields?</p></li><li><p><strong>ID Index:</strong> Does this block contain the searched ids?</p></li><li><p><strong>Free Text Index:</strong> Which chunks contain the search terms?</p></li></ul><p>Blocks that don&#8217;t match are skipped entirely. For matching blocks, only relevant chunks are downloaded.</p><p><strong>5. Local Processing:</strong> Each Querier:</p><ul><li><p>Downloads assigned chunks from blob storage (or retrieves from local cache)</p></li><li><p>Scans the chunk data for matching logs</p></li><li><p>Streams results back to the lead Querier</p></li></ul><p><strong>6. Result Merging:</strong> The lead Querier merges time-sorted streams from all worker Queriers and Ingesters, returning the final response to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KfKz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KfKz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 424w, https://substackcdn.com/image/fetch/$s_!KfKz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 848w, https://substackcdn.com/image/fetch/$s_!KfKz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 1272w, https://substackcdn.com/image/fetch/$s_!KfKz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KfKz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif" width="1200" height="563" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:563,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2094018,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/174284229?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KfKz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 424w, https://substackcdn.com/image/fetch/$s_!KfKz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 848w, https://substackcdn.com/image/fetch/$s_!KfKz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 1272w, https://substackcdn.com/image/fetch/$s_!KfKz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecce1a4-22fc-4d52-b3ff-e0a925848ac3_1200x563.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><strong>Caching for Performance</strong></p><p>Frequently accessed blocks are cached locally on Querier pods. Rendezvous hashing ensures the same blocks always route to the same Queriers, dramatically improving cache hit rates. This means:</p><ul><li><p>Popular logstreams (actively debugged services) stay hot in cache</p></li><li><p>Repeated queries return results in milliseconds instead of seconds</p></li><li><p>Reduced blob storage egress costs</p></li></ul><p><strong>Stateless Design Benefits</strong></p><p>Unlike OpenSearch&#8217;s stateful query coordinators, our Queriers are completely stateless:</p><ul><li><p>No cluster coordination or leader election needed</p></li><li><p>Queriers can be added or removed instantly; just scale the deployment</p></li><li><p>Node failures don&#8217;t require recovery; queries simply route elsewhere</p></li><li><p>No &#8220;yellow&#8221; state degradation under load</p></li></ul><p>The trade-off: performance depends on cache hit rates. Because LogProc does not build a massive reverse index, it needs to get and consult indexes for each block, download blocks, and scan them for matches. In practice, we&#8217;ve seen that indexes filter out ~95% of blocks. In addition, parallelizing work and caching also significantly reduces query times even for searching across all logstreams. We keep index caches warm for the last 7 days, maintaining an almost 99% cache hit ratio for queries for that period.</p><p>The result: simple query engine that achieved better performance compared to our OpenSearch clusters. For our biggest cluster, P50 is ~200ms and P95 is ~6s.</p><h2><strong>Conclusion</strong></h2><p>Building LogProc required challenging fundamental assumptions about logging infrastructure. By separating compute from storage and designing around blob storage&#8217;s limitations, we achieved both dramatic cost reduction and operational simplicity.</p><p>The key architectural decisions that made this possible: app-based partitioning to reduce search space, sticky logstreams to minimize API calls, aggressive batching to optimize blob storage costs, strategic indexing to skip irrelevant data, and stateless query engines for trivial scaling.</p><p>Today, LogProc handles 750 TiB of logs at $0.75 per 100GiB; 4.4x cheaper than our self-managed OpenSearch and 50x cheaper than managed alternatives. More importantly, it requires virtually zero operational overhead. No shard rebalancing, no replica management, no cluster coordination, no degraded states.</p><p><strong>The trade-off:</strong> We lost OpenSearch&#8217;s semantic search capabilities. For our use case of debugging production issues, this was acceptable. The cost savings and elimination of operational headaches far outweighed losing a feature we rarely used.</p><h3>What&#8217;s Next: Open Source LogProc</h3><p>We&#8217;re preparing to open source LogProc in the coming months. As we finalize licensing, packaging, and documentation, we&#8217;d love your input, particularly from teams running 100+ TiB of logs. Reach out to help shape the release.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://techblog.atoms.co/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to our engineering blog to get early access to LogProc.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Study and Update on GenAI DevEx]]></title><description><![CDATA[Internal GenAI developer tools, and learnings about vendors]]></description><link>https://techblog.atoms.co/p/study-and-update-on-genai-devex</link><guid isPermaLink="false">https://techblog.atoms.co/p/study-and-update-on-genai-devex</guid><pubDate>Tue, 09 Sep 2025 13:03:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qM_e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qM_e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qM_e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 424w, https://substackcdn.com/image/fetch/$s_!qM_e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 848w, https://substackcdn.com/image/fetch/$s_!qM_e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 1272w, https://substackcdn.com/image/fetch/$s_!qM_e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qM_e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png" width="1125" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1125,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:983561,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qM_e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 424w, https://substackcdn.com/image/fetch/$s_!qM_e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 848w, https://substackcdn.com/image/fetch/$s_!qM_e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 1272w, https://substackcdn.com/image/fetch/$s_!qM_e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51053ef7-3721-4e75-90c9-465e528413dc_1125x750.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by <a href="https://www.linkedin.com/in/vishruthashok/">Vishruth Ashok</a> (Dev Platform) and <a href="https://www.linkedin.com/in/brian-attwell/">Brian Attwell</a>. Input from <a href="https://www.linkedin.com/in/alexander-filipchik-7946894/">Alex Filipchik</a>, <a href="https://www.linkedin.com/in/limichael/">Michael Li</a>, and many others</em></p><p>At Atoms, we&#8217;ve been proactively investing in ways to accelerate our productivity and improve our Developer Experience (DevEx) with internal and externally built GenAI tools (section: <a href="https://techblog.cloudkitchens.com/i/172980576/comparison-of-genai-vendors-cloudkitchens">Comparison of GenAI Vendors</a>), even though it is clear that many of the industry opinions on this tech are overblown hype (section: <a href="https://techblog.cloudkitchens.com/i/172980576/marketing-hype">Marketing Hype</a>).</p><p>Our internal studies have shown that coding assistants have had persistent categorical limitations over time. Yet subjectively, their usability and effectiveness is increasing. Engineers with high baseline productivity report a sustained median weekly savings of 3 hours and bursts of larger savings when using off-the-shelf GenAI tools (section: <a href="https://techblog.cloudkitchens.com/i/172980576/productivity-impact-by-use-case">Productivity Impact by Use-Case</a>). Deep dives into engineers&#8217; workflows corroborated this median uplift.</p><p>We were initially concerned that these tools would lead to decreased reliability. We&#8217;re finding little evidence of this downside as usage continues amongst seasoned engineers (section: <a href="https://techblog.cloudkitchens.com/i/172980576/adverse-quality-impact-not-found">Adverse Quality Impact Not Found</a>).</p><p>This blog post will cover how we are pragmatically vetting and evaluating these tools and driving widespread adoption to accelerate our developer base.</p><p>The space evolves rapidly, with new players and tools popping up seemingly every month. There is not a clear winner in the market for GenAI coding assistants, and there appears room for disruption of the current market leaders. When we build GenAI into the core of our platforms to improve developer experience, we won&#8217;t do it in a way that couples us to any particular companies.</p><h1>Our Journey So Far</h1><p>We&#8217;ve been early and consistent adopters of new GenAI tools and capabilities. This extended to Developer Experience, which we illustrate in the figure below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZHI3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZHI3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 424w, https://substackcdn.com/image/fetch/$s_!ZHI3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 848w, https://substackcdn.com/image/fetch/$s_!ZHI3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!ZHI3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZHI3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png" width="1456" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:293321,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZHI3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 424w, https://substackcdn.com/image/fetch/$s_!ZHI3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 848w, https://substackcdn.com/image/fetch/$s_!ZHI3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!ZHI3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f63d14-5399-4a7e-bec3-0d8f058c689e_4384x1038.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>For example, early on, we built our LLM Gateway. Investing in this platform wrapper allowed all our engineers to interact with a variety of LLMs without spending time on the burdensome scaffolding of integrating with vendors while allowing us to manage rate limits and have fine-grained usage and cost reporting.</p><h1>Marketing Hype</h1><p>As we steadily integrate GenAI tools into our daily workflows, we&#8217;ve been noticing some other early adopters expressing extreme positions. This started raising questions for us.</p><p><strong>Massive amounts of useful code generated by &#8220;10x engineers&#8221; - <a href="https://x.com/paulg/status/1953289830982664236">Source</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bNqO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bNqO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 424w, https://substackcdn.com/image/fetch/$s_!bNqO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 848w, https://substackcdn.com/image/fetch/$s_!bNqO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 1272w, https://substackcdn.com/image/fetch/$s_!bNqO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bNqO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png" width="1174" height="664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:664,&quot;width&quot;:1174,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bNqO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 424w, https://substackcdn.com/image/fetch/$s_!bNqO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 848w, https://substackcdn.com/image/fetch/$s_!bNqO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 1272w, https://substackcdn.com/image/fetch/$s_!bNqO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb49fabe0-9941-4168-8e2d-30fadfd09709_1174x664.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Who is reviewing this generated code? How is the nameless hotshot&#8217;s code modified and maintained with confidence?</figcaption></figure></div><p><strong>The age of superbuilders has begun - <a href="https://x.com/mckaywrigley/status/1830753005416915153">Source</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d6sf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d6sf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 424w, https://substackcdn.com/image/fetch/$s_!d6sf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 848w, https://substackcdn.com/image/fetch/$s_!d6sf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 1272w, https://substackcdn.com/image/fetch/$s_!d6sf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d6sf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png" width="1178" height="770" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:770,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d6sf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 424w, https://substackcdn.com/image/fetch/$s_!d6sf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 848w, https://substackcdn.com/image/fetch/$s_!d6sf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 1272w, https://substackcdn.com/image/fetch/$s_!d6sf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae049ef-5451-48e5-85f4-e802bcb20d86_1178x770.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Why aren&#8217;t we seeing more case studies about these successful businesses? Odd that we only see posts without proof?</figcaption></figure></div><p><strong>Strong top-down AI-usage mandates - <a href="https://x.com/tobi/status/1909251946235437514?lang=en">Source</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OJaZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OJaZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 424w, https://substackcdn.com/image/fetch/$s_!OJaZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 848w, https://substackcdn.com/image/fetch/$s_!OJaZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 1272w, https://substackcdn.com/image/fetch/$s_!OJaZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OJaZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png" width="1186" height="1230" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1230,&quot;width&quot;:1186,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OJaZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 424w, https://substackcdn.com/image/fetch/$s_!OJaZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 848w, https://substackcdn.com/image/fetch/$s_!OJaZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 1272w, https://substackcdn.com/image/fetch/$s_!OJaZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccd909-d2c7-4c4a-835c-82cecaaf33d7_1186x1230.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Do these tools automatically make everyone a much stronger engineer in every situation? Does it need to be mandated?</figcaption></figure></div><p>With so much hype, we realized we needed to immerse ourselves and perform our own study to uncover the truth. We summarize some of our findings in further sections.</p><h1>Comparison of GenAI Vendors @ Atoms</h1><p>We continuously reassess our tools and strategies to stay on top of newer capabilities and opportunities. We hold off widely rolling out tools before they&#8217;re useful so engineers aren&#8217;t put off by negative first impressions. Below is our latest assessment as of August 2025.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5FH-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5FH-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 424w, https://substackcdn.com/image/fetch/$s_!5FH-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 848w, https://substackcdn.com/image/fetch/$s_!5FH-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 1272w, https://substackcdn.com/image/fetch/$s_!5FH-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5FH-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png" width="1456" height="1706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1706,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1499949,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5FH-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 424w, https://substackcdn.com/image/fetch/$s_!5FH-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 848w, https://substackcdn.com/image/fetch/$s_!5FH-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 1272w, https://substackcdn.com/image/fetch/$s_!5FH-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16212de9-7f1c-4639-bf67-0ac5e54ae664_2030x2378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <em>CK Grade&#8482;</em> is a relative score, with A being the maximum value. There is no A+ since &#8220;There&#8217;s always room for improvement&#8221; according to my high school English teacher.</p><p>Since the developer satisfaction has been high with these tools, driving adoption has been fairly organic without the requirement of any strong top-down mandates. We&#8217;ve run a couple lightweight AI enablement sessions, but developers see others using it effectively and being more productive, get curious and test it out themselves.</p><h1>Productivity Impact by Use-Case</h1><p>Our baseline engineering productivity is high. CK engineering candidates are mainly sourced from big tech or highly successful startups. And our interview pass rate is only 1-3% despite strong sourcing pedigree. Of these engineers, we only included engineers in our study group with 5+ years of coding experience. </p><p>We measure a basket of metrics and countermeasures to try and understand impact. No single measure is sufficient. For example, engineers who opted into using GenAI tools experienced a <strong>10-15%</strong> uplift on lines of code shipped. But without the ability to control for participant enthusiasm, this single measure doesn&#8217;t clearly show causality. Nor is more LOC necessarily a good thing.</p><p>Therefore, we also kept tabs on <a href="https://queue.acm.org/detail.cfm?id=3454124">SPACE</a> metrics, DAU, MAU, # completions, self-reported time savings, and did a number of deep dives into individual engineering workflows to try and overcome self-reporting bias (see <a href="https://arxiv.org/abs/2507.09089">study</a>).</p><p>Our conclusion from these composite signals is that our engineers are saving a median of 3h per week as a result of using GenAI DevEx tools like Cursor compared to a baseline of conventional development + ChatGPT usage. And the time saving varies widely based on activity. We corroborated this via individual deep-dives.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aDM2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aDM2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 424w, https://substackcdn.com/image/fetch/$s_!aDM2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 848w, https://substackcdn.com/image/fetch/$s_!aDM2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 1272w, https://substackcdn.com/image/fetch/$s_!aDM2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aDM2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png" width="1456" height="2062" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2062,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1217953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aDM2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 424w, https://substackcdn.com/image/fetch/$s_!aDM2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 848w, https://substackcdn.com/image/fetch/$s_!aDM2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 1272w, https://substackcdn.com/image/fetch/$s_!aDM2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8697c101-8d43-4436-abe0-d35dab04748d_1558x2206.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Notably, we have not been able to tie back any of these improvements to overall engineering project velocity. We believe that this is because active coding takes &lt;25% of engineers time typically (see <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2024/11/Time-Warp-Developer-Productivity-Study.pdf">study</a>). And even while actively writing code, a small subset of the time is typically spent on tasks where these tools shine.</p><h2>Adverse Quality Impact Not Found</h2><p>GenAI-produced code can contain subtle bugs or shocking naivety that humans would not produce themselves. And, engineers find GenAI especially useful for generating code in domains they are unfamiliar with. For example, backend engineers using GenAI to generate frontend code. As a result, it&#8217;s easy to imagine that issues in code being produced by GenAI might slip undetected by the engineers using GenAI into our production code base.</p><p>At this early stage, we haven&#8217;t seen evidence of quality issues impacting production (incidents, bugs, etc). To try and find other leading indicators of risk, we surveyed teammates of the heaviest GenAI users at the company.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vcDT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vcDT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 424w, https://substackcdn.com/image/fetch/$s_!vcDT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 848w, https://substackcdn.com/image/fetch/$s_!vcDT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 1272w, https://substackcdn.com/image/fetch/$s_!vcDT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vcDT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png" width="728" height="189.34529147982062" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:1338,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:136907,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vcDT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 424w, https://substackcdn.com/image/fetch/$s_!vcDT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 848w, https://substackcdn.com/image/fetch/$s_!vcDT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 1272w, https://substackcdn.com/image/fetch/$s_!vcDT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167c6853-a95b-4e6a-be09-54a1f901148d_1338x348.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We asked a number of related questions to try and uncover problems. The most common complaint: the frequency of seeing code with style mismatched from our codebase has increased. This seems minor and fixable.</p><p>We haven&#8217;t seen a serious trend of increased bugs being discovered during review or overall reviewer burden increasing.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zMO4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zMO4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 424w, https://substackcdn.com/image/fetch/$s_!zMO4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 848w, https://substackcdn.com/image/fetch/$s_!zMO4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 1272w, https://substackcdn.com/image/fetch/$s_!zMO4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zMO4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png" width="1334" height="276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:276,&quot;width&quot;:1334,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:119966,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zMO4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 424w, https://substackcdn.com/image/fetch/$s_!zMO4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 848w, https://substackcdn.com/image/fetch/$s_!zMO4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 1272w, https://substackcdn.com/image/fetch/$s_!zMO4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8648b96b-75ac-47e8-b76c-2e1b75512a09_1334x276.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u23Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u23Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 424w, https://substackcdn.com/image/fetch/$s_!u23Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 848w, https://substackcdn.com/image/fetch/$s_!u23Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 1272w, https://substackcdn.com/image/fetch/$s_!u23Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u23Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png" width="1332" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1332,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:143468,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!u23Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 424w, https://substackcdn.com/image/fetch/$s_!u23Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 848w, https://substackcdn.com/image/fetch/$s_!u23Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 1272w, https://substackcdn.com/image/fetch/$s_!u23Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045d1bd9-65e5-4989-914d-6049c239e6f4_1332x342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Interesting GenAI Applications</h1><p>There are some basic GenAI applications involving summarization and contextual chatbots that index relevant data from Confluence, Slack, Google Docs and GitHub.</p><p>We wanted to explore avenues beyond that. Here are some such use-cases that are already live.</p><h2>On-call AI</h2><p>Instead of restricting GenAI to only code-writing, we also built an agent that queries observability data, metrics, and logs, providing engineers with crucial insights to quickly pinpoint and mitigate incidents. Early experiments demonstrate the agent's remarkable effectiveness, identifying incident root causes in one or two shots. Future improvements in this space will be evolving beyond the text-based chat interface, improved context management and custom fine-tuned LLM models specifically geared towards observability data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mJTQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mJTQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 424w, https://substackcdn.com/image/fetch/$s_!mJTQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 848w, https://substackcdn.com/image/fetch/$s_!mJTQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!mJTQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mJTQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png" width="1456" height="1276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1276,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mJTQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 424w, https://substackcdn.com/image/fetch/$s_!mJTQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 848w, https://substackcdn.com/image/fetch/$s_!mJTQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!mJTQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4e065cb-e25a-4aea-8686-32df97f3a721_1568x1374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The reason we can pull something like this off meaningfully is due to the past standardization efforts of our CICD processes and overall service observability.</p><h2>Monorepo Agent</h2><p>A GenAI tool equipped with MCP servers to understand the entire software development process at Atoms, starting with branch creation, local iteration, PR creation and CI validation.</p><p>Our first usage of this tool was on toil-heavy maintenance tasks, such as cleaning up static analysis suppressions that had been lingering in our backlog.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bWVr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bWVr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!bWVr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!bWVr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!bWVr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bWVr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bWVr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!bWVr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!bWVr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!bWVr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47ecc3c-7593-4eb9-97cf-b615078c75a4_1600x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>AnalyticsGPT</h2><p>This is our take on the next generation of data analysts. A tool that&#8217;s designed to help answer data queries via natural language while retaining the user&#8217;s access controls which shows its chain of thought so you can trust and verify the analysis.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q3IL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q3IL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 424w, https://substackcdn.com/image/fetch/$s_!q3IL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 848w, https://substackcdn.com/image/fetch/$s_!q3IL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 1272w, https://substackcdn.com/image/fetch/$s_!q3IL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q3IL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2458690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q3IL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 424w, https://substackcdn.com/image/fetch/$s_!q3IL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 848w, https://substackcdn.com/image/fetch/$s_!q3IL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 1272w, https://substackcdn.com/image/fetch/$s_!q3IL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34839f3c-85e8-4a08-9fa8-014af683e121_3104x1764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Superset Agent</h2><p>We find that impact and adoption is organically high when we add targeted features to where developers are already spending a lot of time. An example is our agentic superset assistant that comes baked in with the full knowledge of the dataset and the table schema.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2iQL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2iQL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 424w, https://substackcdn.com/image/fetch/$s_!2iQL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 848w, https://substackcdn.com/image/fetch/$s_!2iQL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 1272w, https://substackcdn.com/image/fetch/$s_!2iQL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2iQL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png" width="1416" height="766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:766,&quot;width&quot;:1416,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:215889,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/172980576?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2iQL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 424w, https://substackcdn.com/image/fetch/$s_!2iQL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 848w, https://substackcdn.com/image/fetch/$s_!2iQL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 1272w, https://substackcdn.com/image/fetch/$s_!2iQL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e0e15d-88c9-445d-b58d-09c3b3ee6101_1416x766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Future DevEx Platform Explorations</h1><h2><strong>From Single Agent to Distributed Execution</strong></h2><p>As GenAI tools evolve beyond simple code completion, we're seeing a shift toward distributed agent systems that pass state and partial task completion between specialized components. This approach mirrors how software teams coordinate work, but compressed into a single developer's workflow. It&#8217;s almost like we&#8217;re reinventing how teams collaborate and are all playing the role of an Engineering Manager.</p><p>Investing in ephemeral development environments could pay off here to avoid being constrained by local resources.</p><h2><strong>Investing More in Evals and Observability</strong></h2><p>Due to the inherent non-determinism that comes with LLMs today, we need to build in mechanisms to know how well an LLM use-case is performing and monitor its quality and adoption over time.</p><h2><strong>E2E Agentic Campaign Management</strong></h2><p>Right now, we have the ability to have an agent with an understanding of our end-to-end developer lifecycle, but as always, we are hungry for more.</p><p>We want to be able to make it easier to run larger campaigns that require several dependent changes with more granular checkpoint validation. This will allow us to make the leap from smaller one-off changes like fixing specific bugs to doing full-fledged framework migrations - from an individual productivity boost to collaborative workflow with org-wide impact.</p><p>One concrete test case we&#8217;d like to prove this for is our white-glove migration of ML and DS tech stack&#8217;s package manager from <a href="https://python-poetry.org/">Poetry</a> to <a href="https://github.com/astral-sh/uv">uv</a>, specifically to validate the claimed performance improvements.</p><h1>Conclusion</h1><p>We are very excited about this new technology that makes developers feel like we have superpowers. We&#8217;re optimistic about the future, but we&#8217;re not seeing the &#8220;fall of software engineering&#8221; or that &#8220;SaaS is dead&#8221; like tech social media would have you believe. We&#8217;re actually seeing even more reward for good engineering fundamentals.</p><p>We&#8217;re publishing this blog post at the risk of it being out-of-date within a few weeks (if we&#8217;re lucky). But then again, maybe we could just have AI generate the next iteration autonomously. Right?</p>]]></content:encoded></item><item><title><![CDATA[Cloudless Blob: Portable and Cloud Agnostic Blob Storage]]></title><description><![CDATA[Transparently migrating hundreds of buckets across clouds]]></description><link>https://techblog.atoms.co/p/cloudless-portable-blob</link><guid isPermaLink="false">https://techblog.atoms.co/p/cloudless-portable-blob</guid><pubDate>Tue, 05 Aug 2025 15:01:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fNfL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fNfL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fNfL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 424w, https://substackcdn.com/image/fetch/$s_!fNfL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 848w, https://substackcdn.com/image/fetch/$s_!fNfL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 1272w, https://substackcdn.com/image/fetch/$s_!fNfL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fNfL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png" width="913" height="610" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:610,&quot;width&quot;:913,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1163997,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://techblog.atoms.co/i/167728027?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fNfL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 424w, https://substackcdn.com/image/fetch/$s_!fNfL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 848w, https://substackcdn.com/image/fetch/$s_!fNfL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 1272w, https://substackcdn.com/image/fetch/$s_!fNfL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64c7e735-5e76-4286-bf4b-56ba6ce02bd3_913x610.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by <a href="https://www.linkedin.com/in/fmogensen/">Frederik Mogensen</a>, member of the storage team who led the development of the Bucket-Gateway infrastructure.</em></p><p>Two years ago, we started a major cloud migration project between two large cloud providers. We migrated our entire microservice stack with hundreds of services, Kubernetes clusters, databases, low level infra, and streaming/batch pipelines in less than a year. And, we did it without downtime or material involvement from product engineering teams.</p><p>In this post, we focus on our strategy for migrating hundreds of buckets and petabytes of data across cloud providers. Discover how a cloud-agnostic, portable, and secure blob storage solution made this possible while solving other major problems for us.</p><h1>Initial design requirements</h1><p>The main requirements for the cloud migration solution were to be able to move all data from one cloud provider to another, without noticeable disruptions for the clients.</p><p>To accomplish this we identified the need for the following three key properties.</p><ul><li><p>A uniform data protocol on top of each cloud provider.</p></li><li><p>A solution to transparently move data between providers without disrupting data access.</p></li><li><p>A global authentication and authorization system for all requests, no matter which provider a client tries to access.</p></li></ul><h3>The uniform data protocol</h3><p>We wanted to ensure that our applications would run seamlessly on any cloud provider going forward. This meant that the blob storage protocols used in all our in-house and open source projects could stay the same, no matter the underlying cloud provider. There is currently no shared standard protocol that works on all providers. Each cloud provider has their own blob storage infrastructure with custom APIs, and client libraries for each language with caveats and bugs.</p><h3>Transparent data portability</h3><p>The solution must allow for seamless data movement behind the scenes, as seen from the clients point of view. Like a modern cloud-native storage engine, such as CockroachDB where we can define the desired data placement and data moves automatically. Not like running on old-school hard drives where data is stuck when it is written, unless manually copied around.</p><h3>Security and Audit</h3><p>Blob storage stores and manages sensitive data for all parts of our business. Therefore our new blob service would need to include a strong authorization and authentication strategy. Leveraging the Spiffe network identities all our applications already get from Istio we can implement a zero-trust blob storage system that works for all cloud providers. No more access key distribution and rotations and no more provider specific service accounts for each microservice.</p><p>Finally our new design should also include uniform observability and audit logging for all backing cloud providers. The new blob storage implementation should know when any data was accessed, by applications or developers, as well as be able to generate cost insights down to the single blob level.</p><h1>Architecting for portability and scalability</h1><p>To create a cloud-agnostic layer that will work for all applications we decided to implement (most of) the S3 protocol. This protocol is very well documented and supported by most modern applications.</p><p>Our new Blob Storage architecture consists of a set of management components, and a new horizontally scalable stateless gateway.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TBOl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TBOl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 424w, https://substackcdn.com/image/fetch/$s_!TBOl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 848w, https://substackcdn.com/image/fetch/$s_!TBOl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!TBOl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TBOl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png" width="1456" height="963" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:963,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TBOl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 424w, https://substackcdn.com/image/fetch/$s_!TBOl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 848w, https://substackcdn.com/image/fetch/$s_!TBOl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!TBOl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1208361c-68d5-42df-a16d-c4c2374f3476_1600x1058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Request flow from Client Application to backing blob bucket</em></p><h2>The Bucket-Gateway</h2><p>The Bucket-Gateway is designed to be S3 compatible and capable of fronting multiple cloud providers, such as Azure Blob Storage, Google Cloud Storage, and Amazon S3. This new component allows us to serve logical buckets that can be routed to actual cloud buckets on different cloud providers. By using the Bucket-Gateway to enforce Authentication (AuthN) and Authorization (AuthZ), it also allows us a single place to control access to our data no matter which cloud provider the data is actually stored on.</p><h3>Routing</h3><p>The Bucket-Gateway keeps the complete list of logical buckets (e.g. <code>my-frontend</code> or <code>postgresql-backups</code>), and the mapping for which actual backing cloud buckets to route the requests to (e.g. <code>azure-cloudkitchens-frontend-xyz</code> or <code>gcp-cloudkitchens-pg-backups-123</code>).</p><p>This allows us to swap the backing bucket to another region/cloud at any time, without having to re-deploy, re-configure, or even inform our clients and stakeholders. It also removes strict requirements that any bucket name needs to be globally unique which most cloud providers have.</p><h3>Authentication and Authorization</h3><p>To allow for easy onboarding of all the client applications we decided to add two different authentication methods. The standard S3 authentication way with Access key ID and Secret Key (Blob Access Keys), as well as a new option of using Spiffe identities from Istio. The Blob Access Keys are used to authenticate any workload or person running outside of our service mesh, as well as for some open source applications that do not allow anonymous authentication in their S3 blob implementations. The Spiffe Identity authentication is used by all applications inside our mesh. Running in the mesh with mTLS enabled between all applications, we have a strong enough security and consistency guarantee from our transport layer that we can remove the signing and checksumming done by the standard Blob Access Keys authentication. This removes a lot of crypto and hashing CPU cycles, as well as allows us to make request handling in a much more streaming fashion.</p><p>The Spiffe network identity authentication model allows us to create a zero trust architecture between all client applications and the Bucket-Gateway. Using Spiffe network identities for AuthN removes the need to distribute and rotate keys, as this is already done by Istio. The Bucket-Gateway can use the Spiffe identity from any incoming http request and ask our in-house ABAC-style authorization service if the calling service is allowed to perform the desired S3 action, on the given bucket, for the specific path.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aBzK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aBzK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 424w, https://substackcdn.com/image/fetch/$s_!aBzK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 848w, https://substackcdn.com/image/fetch/$s_!aBzK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 1272w, https://substackcdn.com/image/fetch/$s_!aBzK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aBzK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aBzK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 424w, https://substackcdn.com/image/fetch/$s_!aBzK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 848w, https://substackcdn.com/image/fetch/$s_!aBzK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 1272w, https://substackcdn.com/image/fetch/$s_!aBzK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2365bf-6d44-4cc6-8ec6-f21804b8b472_1600x665.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Authorization: The Bucket-Gateway will call the Authorization Service to get an authorization verdict for each incoming request.</em></p><p><em>Verdicts are cached for a short time, to keep latencies down and guard against DDOS attacks.</em></p><pre><code># The following is an example of a ABAC style access request

Principal = "postgresql-operator"
Attributes
    namespace = "BLOB"
    action = "GetObject"
    bucket = "postgresql-backups"
    object = "myPgCluster/wal/123.tar"</code></pre><h3>Observability</h3><p>Because the Bucket-Gateway handles all access to blob storage on all our cloud providers, this is a great place to add the required observability. The Bucket-Gateway exposes Prometheus metrics, audit logging, tracing, and all info needed to attribute costs to any client application. Getting request/response logging with the resolution down to a single blob will either be impossible or extremely expensive depending on the cloud provider. With this we can get cost attribution all the way down to a single blob. We know exactly which client read/wrote a given blob, when they did, and how often.</p><h3>Tradeoffs</h3><p>Implementing a proxy service like the Bucket-Gateway will of course also introduce certain costs. First, some latency is inevitably added to each data request as it must pass through an extra network jump and a service before reaching the actual cloud provider. We spend a fair amount of time optimizing the Bucket-Gateway by keeping memory allocations low and ensuring that any metadata we need to handle a request is already cached in an up-to-date in-memory cache. Compared to the time spent moving large amounts of data into and out of external storage the extra latencies from the Bucket-Gateway is negligible.</p><p>The Bucket-Gateway also introduces an additional point of failure. Apart from rigorous testing, we mitigate this by autoscaling the Bucket-Gateway horizontally across multiple cloud regions, and by isolating heavy data warehouse clients from latency sensitive end-user traffic.</p><h1>Migration flow</h1><p>The migration flow consists of moving data and changing the source of truth. For the actual migration we need two extra components.</p><h2>The Bucket-Migrator</h2><p>The Bucket-Migrator is responsible for moving the actual data from one backing bucket on any cloud provider, to another backing bucket on any other cloud provider.</p><p>It handles initial backfill (<em>Copy</em>) of the new backing bucket by copying all blobs from the source bucket to the sink bucket, as well as synchronizing (<em>Sync</em>) the content of the two backing buckets on demand.</p><h2>The Bucket-Operator</h2><p>The Bucket-Operator looks at a desired specification for a bucket and tries to ensure that the real world matches the spec.</p><p>For a cloud migration this is done by provisioning a new backing bucket in the right cloud and region, backfilling the new bucket by starting a job in the Bucket-Migrator, and setting up the correct authorization policies for any client on a given bucket.</p><p>During the migration from one backing bucket to another the Bucket-Operator is responsible for changing permissions on the logical bucket (setting the bucket as read-only at critical times), triggering copy and sync jobs in the Bucket-Migrator, and updating routing information for the logical bucket. Thereby setting the new bucket as the source of truth.</p><h2>Copy and Sync</h2><p>To ensure write-unavailability is kept to an absolute minimum we implemented the following Copy-and-Sync design.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IS8A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IS8A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 424w, https://substackcdn.com/image/fetch/$s_!IS8A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 848w, https://substackcdn.com/image/fetch/$s_!IS8A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 1272w, https://substackcdn.com/image/fetch/$s_!IS8A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IS8A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png" width="1456" height="1268" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1268,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IS8A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 424w, https://substackcdn.com/image/fetch/$s_!IS8A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 848w, https://substackcdn.com/image/fetch/$s_!IS8A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 1272w, https://substackcdn.com/image/fetch/$s_!IS8A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f815cec-62e5-446c-9422-9c837a688a41_1600x1393.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Migration flow: Bucket-Operator manages the permissions on the bucket, and instructs the Bucket-Migrator to copy and sync blobs.</em></p><h3>The Copy Phase</h3><p>The Copy part of the algorithm will make an initial bulk copy of all blobs in the source bucket to the sink. This is performed by listing the source bucket and starting a server-side upload for each blob, by signing a URL for the source bucket and instructing the sink bucket to fetch the blob from the signed-URL. This ensures that we do not have to pull all the data through our Kubernetes clusters and internal network.</p><p>The Copy phase does the bulk of the work and the duration of this phase depends on the number of files in the bucket. The phase is completely asynchronous with no impacts on the client workloads, even though it might run for multiple hours or days.</p><h3>The Sync Phase</h3><p>The Sync phase of the migration lists all files in the bucket on both source and destination to check whether they are identical. It performs the following series of checks and actions:</p><ul><li><p>Check whether any blobs are missing in the sink</p><ul><li><p>If any blobs are missing in the sink, it uploads them from the source</p></li></ul></li><li><p>Checks whether any blobs exists in the sink that are no longer in the source bucket</p><ul><li><p>If any such blobs exist, they are deleted.</p></li></ul></li><li><p>Check if the ETag on the sink and source blobs match</p><ul><li><p>If not it re-uploads the new version of the file</p></li></ul></li><li><p>Check if the Metadata content matches the sink and source blobs</p><ul><li><p>If not it updates the metadata</p></li></ul></li></ul><p>The duration of the Sync phase depends on the number of files in the bucket and the amount of changes performed since the one-time copy was started. During this phase the bucket must be read-only to ensure consistency.</p><p>The actual migration process consists of the following 6 simple steps:</p><ol><li><p>Provision new backing bucket on destination cloud provider</p></li><li><p>Initial One-time Copy of all blobs from the Source bucket to the Sink bucket</p><ol><li><p>This part is completely asynchronous and has no impact on the availability of the bucket</p></li></ol></li><li><p>Disable writes on the bucket in the Bucket-Gateways</p><ol><li><p>After this the bucket is only read-available.</p></li></ol></li><li><p>Sync blobs that has changed or been deleted since the Copy phase</p><ol><li><p>The length of this step is proportional to the number of blobs changed and deleted since the copy phase.</p></li><li><p>To ensure this step is as fast as possible we plan it for a time of day/week when the given bucket has the least amount of activity.</p></li><li><p>The sync can also be run multiple times without problems. This means that for high through buckets, we can run the sync once, then disable writes, and then run the sync again. In this case the last sync only needs to handle the blobs updated while running the first sync.</p></li></ol></li><li><p>Update the routing information for the logical bucket in the Bucket-Gateways to route all requests to the Sink bucket.</p></li><li><p>Re-enable writes in the Bucket-Gateway for the logical bucket.</p><ol><li><p>The bucket is now completely migrated and write-available again.</p></li></ol></li></ol><p>For some clients we were able to completely skip the Sync phase, by pausing client ingestion pipelines, such as Flink and Spark jobs or frontend build pipelines.</p><h1>The End Result</h1><p>We created a cloud-agnostic interface and a new Blob Storage architecture with horizontally scalable components, including the Bucket-Gateway, Bucket-Operator, and Bucket-Migrator. The Bucket-Gateway handles S3-compatible requests, routing, authentication, authorization, and observability. The Bucket-Migrator orchestrates data migration using a Copy-and-Sync design, ensuring minimal write unavailability and no read unavailability.</p><p>We successfully onboarded all our stakeholders to the new Bucket-Gateway instead of going directly to the cloud provider proprietary APIs. Using the tooling described above we migrated hundreds of buckets with petabytes of data, from multiple regions and environments, to our new home cloud. Most clients experienced only a few seconds or minutes of write unavailability, and no clients experienced read unavailability.</p><p>This new architecture also allows us a very interesting set of additional features. The later iterations of the Bucket-Gateway includes</p><ul><li><p>Sharding buckets across Storage Accounts in Azure to remove Azure throughput limits.</p></li><li><p>Implementing shared in-cluster caches to skip reading the same file multiple times in different services</p></li><li><p>Multi-region/multi-cloud buckets with replicated data for local reads and disaster recovery/business continuity.</p></li></ul><p>Read more about how we implemented those in the next article Cloudless Blob post.</p><p><em>Cover photo by <a href="https://unsplash.com/@carloshorton?utm_content=creditCopyText&amp;utm_medium=referral&amp;utm_source=unsplash">Carlos Horton</a>.</em></p>]]></content:encoded></item><item><title><![CDATA[Otter Assistant: LLM Support Agent]]></title><description><![CDATA[How to build an LLM support agent that users appreciate]]></description><link>https://techblog.atoms.co/p/llm-support-agent</link><guid isPermaLink="false">https://techblog.atoms.co/p/llm-support-agent</guid><pubDate>Mon, 07 Jul 2025 15:30:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!124h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!124h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!124h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 424w, https://substackcdn.com/image/fetch/$s_!124h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 848w, https://substackcdn.com/image/fetch/$s_!124h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!124h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!124h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png" width="1456" height="1059" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1059,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!124h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 424w, https://substackcdn.com/image/fetch/$s_!124h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 848w, https://substackcdn.com/image/fetch/$s_!124h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!124h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774a6108-bb5a-4997-b029-52af1fdfcced_1600x1164.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by LV Lu (Data Science), Rob Harell (Product), and Brian Attwell.</em></p><p>An exceptional customer support experience is the cornerstone of a lasting customer relationship. This is why we built Otter Assistant, our in-house Gen-AI chatbot that currently handles ~50% of inbound customer requests quickly without human intervention. This post describes our journey over the past year building and scaling Otter Assistant.</p><h2>About Otter</h2><p>Otter is a delivery-native restaurant hardware and software suite used to manage restaurant operations and aggregate &amp; derive insights from restaurant data. Each of its product lines has many features, with many integrations. And restaurants want a lot of customizability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a-On!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a-On!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 424w, https://substackcdn.com/image/fetch/$s_!a-On!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 848w, https://substackcdn.com/image/fetch/$s_!a-On!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 1272w, https://substackcdn.com/image/fetch/$s_!a-On!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a-On!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png" width="1456" height="864" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:336962,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/167407782?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a-On!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 424w, https://substackcdn.com/image/fetch/$s_!a-On!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 848w, https://substackcdn.com/image/fetch/$s_!a-On!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 1272w, https://substackcdn.com/image/fetch/$s_!a-On!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd09cc2c3-1653-4f57-817c-695cc4f6eb59_2434x1444.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Some of Otter&#8217;s features</figcaption></figure></div><p>A broad feature set creates customer demand for support. These customers appreciate the speed and reliability that Otter Assistant can deliver over traditional call center agents, as long as we continue to offer the option to contact a 24/7 human agent if preferred.</p><h2>Build vs Buy</h2><p>In Q1 2024, we started by analyzing the distribution of customer issues and found that resolving these tickets required deep integration with our systems. For example, a support agent has tightly controlled permission to review a customer&#8217;s menu, update their account, or tweak remote printer configuration. At the time, there were no vendors that offered our needed level of integration flexibility without hard coded decision trees, like shown in the image below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XWvH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XWvH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 424w, https://substackcdn.com/image/fetch/$s_!XWvH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 848w, https://substackcdn.com/image/fetch/$s_!XWvH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 1272w, https://substackcdn.com/image/fetch/$s_!XWvH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XWvH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png" width="1456" height="1223" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1223,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224580,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/167407782?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XWvH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 424w, https://substackcdn.com/image/fetch/$s_!XWvH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 848w, https://substackcdn.com/image/fetch/$s_!XWvH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 1272w, https://substackcdn.com/image/fetch/$s_!XWvH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f326b4e-9959-42e9-a7a9-b6b54bff1174_2320x1948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Small part of a Zendesk decision tree</figcaption></figure></div><p>And when we began to experiment, we found that much of the historical value provided by these vendors (Zendesk has great features for configuring workflows through UI, NLP based intent matching, etc) was significantly reduced by LLMs. LLMs reduced the need for this vendor infra, allowing us to largely focus on domain specific problem solving when building our agent.</p><h2>Bot architecture</h2><p>The bot system and workflows span an online conversation flow and an offline management flow. The next section covers the primary components of each.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JbBz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JbBz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 424w, https://substackcdn.com/image/fetch/$s_!JbBz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 848w, https://substackcdn.com/image/fetch/$s_!JbBz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!JbBz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JbBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png" width="1456" height="821" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:821,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:368057,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/167407782?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JbBz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 424w, https://substackcdn.com/image/fetch/$s_!JbBz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 848w, https://substackcdn.com/image/fetch/$s_!JbBz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!JbBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b6d5228-7ead-44ec-9b37-8f2a840642ce_2218x1250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Conversation flow</h2><p>When we began implementation in Q2 of &#8216;24, the term &#8220;<a href="https://www.deeplearning.ai/the-batch/llms-evolve-with-agentic-workflows-enabling-autonomous-reasoning-and-collaboration/">agentic</a>&#8221; had yet caught on. However, by setting out to emulate the diagnostic and resolution steps taken by our agents, we naturally followed agentic approach. Concretely, we designed the bot to use <a href="https://platform.openai.com/docs/guides/function-calling?api-mode=chat">function calling</a> and mimic how human support agents work:</p><ol><li><p>Based on customer request, find out the corresponding predefined procedure</p></li><li><p>If there is one, follow the steps</p></li><li><p>If not, try to conduct research in knowledge base</p></li><li><p>When run into issues or can not find information, escalate</p></li></ol><p>To accomplish this, we designed four main types of functions: <strong>GetRunbook</strong>, <strong>API &amp; Widget Functions</strong>, <strong>Research</strong>, and <strong>EscalateToHuman</strong>, to accomplish each of the essential support tasks above.</p><h4>GetRunbook function - injects relevant issue resolution steps</h4><p>After evaluating support issues by volume and resolution complexity, we selected the top high-volume and low-to-medium complexity issues to translate into bot &#8220;runbooks&#8221;. These runbooks supply instructions on the steps the bot should take to diagnose the user&#8217;s issue statement and tie it to a known root cause, and then to either make the necessary API calls to resolve the issue, or surface the corresponding solution &#8220;widget&#8221; (both of which will be covered below). Conceptually, the runbooks function like a decision tree, but unlike the previous generation of bot tech, can be written in plain text, making them 1) significantly easier to implement and maintain 2) more modular and traversable during runtime diagnosis.</p><p>Mechanically, <strong>GetRunbook</strong> function takes the user issue description as an input and outputs a corresponding runbook if it can find one. Otherwise it returns &#8220;Not Found&#8221;. Under the hood, we use a LLM for intent matching and runbook retrieval. Based on the embedding of user issue description, retrieve relevant runbooks from our runbook repository (vector db) based on semantic similarity [<a href="https://medium.com/@vladris/embeddings-and-vector-databases-732f9927b377">Ref</a>], then issue a separate LLM call to let LLM pick the correct one from the candidates and return not found if there is no good match.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OK0-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OK0-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 424w, https://substackcdn.com/image/fetch/$s_!OK0-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 848w, https://substackcdn.com/image/fetch/$s_!OK0-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 1272w, https://substackcdn.com/image/fetch/$s_!OK0-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OK0-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png" width="1456" height="687" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OK0-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 424w, https://substackcdn.com/image/fetch/$s_!OK0-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 848w, https://substackcdn.com/image/fetch/$s_!OK0-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 1272w, https://substackcdn.com/image/fetch/$s_!OK0-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3aa5c4-8955-4206-8912-469b0d0d4354_1600x755.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If there&#8217;s a match, the LLM then begins to work through the listed runbook steps, gathering follow-up information from the user and/or executing API calls as needed until it reaches the end.</p><h4>API call functions - retrieves customer data for diagnosis</h4><p>As the bot works through a runbook, it has the ability to choose from a list of API wrapper function calls to gather information (e.g. fetch store status status) and/or modify a user&#8217;s account. Fortunately, we were able to largely reuse pre-existing API calls within the Otter ecosystem.</p><p>Critically, to avoid leaking user data to those that shouldn&#8217;t have access, for any internal APIs, we call backend APIs with the user token passed over as part of each Otter Assistant service request. This way, we maintain and reuse existing permission control models and auth infrastructure.</p><h4>Widget functions - performs action with user confirmation</h4><p>After the bot has identified the root cause, it then takes the appropriate action to address the issue. With exception of simple account modifications, we present most write operations within &#8220;widgets&#8221;, or embedded UI modules. For example, customers requesting to pause their store are presented with the following widget:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p5hf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p5hf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 424w, https://substackcdn.com/image/fetch/$s_!p5hf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 848w, https://substackcdn.com/image/fetch/$s_!p5hf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 1272w, https://substackcdn.com/image/fetch/$s_!p5hf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p5hf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png" width="233" height="453.7368421052632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:399,&quot;resizeWidth&quot;:233,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p5hf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 424w, https://substackcdn.com/image/fetch/$s_!p5hf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 848w, https://substackcdn.com/image/fetch/$s_!p5hf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 1272w, https://substackcdn.com/image/fetch/$s_!p5hf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2ba6e94-3a9a-412f-930c-b28f21509e6c_399x777.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Otter Assistant Store Pause/Unpause Widget</figcaption></figure></div><p>Widgets provide the following benefits:</p><ol><li><p>Encapsulation/reuse</p></li><li><p>Distributed ownership</p></li><li><p>Information density</p></li><li><p>Easy for user to confirm &amp; eliminates risk of hallucination</p></li></ol><p>When the bot decides it&#8217;s appropriate to surface a widget, it calls the corresponding widget function (informing the LLM that a widget is being displayed) and simultaneously emits a notification to the external chat UI. The chat UI then renders the widget UI within the message.</p><p>For any critical write operation (e.g. Pause store, update price) we require user review and explicit click confirmation before kicking off the operation. We strictly follow this approach to mitigate risk from LLM hallucination (e.g. incorrectly assume that the user wants to pause a store). This approach also provides users a quick way to modify inputs for the write operation if the LLM got details wrong.</p><h4>Research function - finds answers in knowledge base</h4><p>The research function is designed to retrieve and summarize helpful answers to user questions that don&#8217;t match a runbook. We designed the research function to mimic how humans find answers online in our <a href="https://helpdesk.tryotter.com/hc/en-us">help articles</a>: conduct a search with a question, then read through top search results to come up with final answers.</p><p>To implement this flow, we first convert the help articles in Otter&#8217;s knowledge base to embeddings using an LLM offline and store them in a vector db. Then, when we receive a request, we convert the user question to embeddings to retrieve the top relevant articles from vector db based on semantic similarity. Next, we issue a LLM request to each top article to find relevant answers to the user&#8217;s question. The process stops when either it has found <em>n</em> answers or gone through <em>m</em> results (both configurable parameters). We lastly issue a separate LLM call to combine the answers into a final answer to return as the function response.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TBij!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TBij!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 424w, https://substackcdn.com/image/fetch/$s_!TBij!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 848w, https://substackcdn.com/image/fetch/$s_!TBij!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 1272w, https://substackcdn.com/image/fetch/$s_!TBij!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TBij!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png" width="1456" height="652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3f06122-a930-49b4-9630-12b157b2b164_1600x717.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TBij!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 424w, https://substackcdn.com/image/fetch/$s_!TBij!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 848w, https://substackcdn.com/image/fetch/$s_!TBij!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 1272w, https://substackcdn.com/image/fetch/$s_!TBij!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3f06122-a930-49b4-9630-12b157b2b164_1600x717.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>EscalateToHuman function - hands off to human agent</h4><p>This function provides LLM the capability to inform us the conversation should be escalated to a human agent. When the LLM detects a user&#8217;s intent to escalate, we can inform the chat message interface to pass conversation control to the assigned human agent, which in turn calls Zendesk to connect to a live agent.</p><h2>Bot management</h2><p>The aforementioned components cover Otter Assistant&#8217;s core conversational capabilities. However, like with any software, robust testing and management processes are needed to ensure the bot works at scale. Unlike with traditional software, however, the inherent randomness and unpredictability in the LLM-powered conversational flow called for a bespoke set of tools to serve this need:</p><ol><li><p>Local development and playground</p></li><li><p>Bot validation testing</p></li><li><p>Bot conversation review &amp; analytics</p></li></ol><h3>Local development and playground</h3><p>Given the stochastic nature of LLMs and multi-modal nature of Otter Assistant conversations (which encompass both text and bot actions/widgets), developers require a chat simulator for effective debugging. To facilitate this, we developed a<a href="https://docs.streamlit.io/develop/api-reference/chat"> Streamlit</a>-based library. This library allows developers to interact with the bot through a web UI, displaying input and output arguments for each function call to ensure the bot's end-to-end flow is correct.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JGQb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JGQb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 424w, https://substackcdn.com/image/fetch/$s_!JGQb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 848w, https://substackcdn.com/image/fetch/$s_!JGQb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!JGQb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JGQb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png" width="486" height="570.9251101321586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1600,&quot;width&quot;:1362,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JGQb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 424w, https://substackcdn.com/image/fetch/$s_!JGQb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 848w, https://substackcdn.com/image/fetch/$s_!JGQb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!JGQb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdb78b10-4560-480b-b537-ac1544d0e682_1362x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Bot validation testing</h3><p>After confirming new capabilities work in the development environment, we pass the bot through a round of validation testing. Given the randomness inherent in LLM systems, it often requires multiple iterations of conversation to expose &amp; verify specific bot behaviors, which is time consuming if done manually. Moreover, changing prompt logic in one place could cause unanticipated behavior changes elsewhere that could be difficult to detect. These challenges surpass traditional software testing frameworks, which rely on 100% deterministic execution and structured output.</p><p>Therefore, we developed a new test and evaluation framework for Otter Assistant and any other chatbot, which involves:</p><ol><li><p>Predefine a set of test scenarios, e.g. customer&#8217;s store is paused</p></li><li><p>For each test scenario we also define a list of expected behaviors, e.g. confirm which store, check status, then launch widget</p></li><li><p>Launch a chatbot using LLM to play as customer to chat with our bot</p></li><li><p>Leverage LLM as a judge to assert on expected behaviors based on conversation transcript between the bot and customer</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dXaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dXaU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 424w, https://substackcdn.com/image/fetch/$s_!dXaU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 848w, https://substackcdn.com/image/fetch/$s_!dXaU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 1272w, https://substackcdn.com/image/fetch/$s_!dXaU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dXaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png" width="1456" height="726" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:726,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dXaU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 424w, https://substackcdn.com/image/fetch/$s_!dXaU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 848w, https://substackcdn.com/image/fetch/$s_!dXaU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 1272w, https://substackcdn.com/image/fetch/$s_!dXaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322abcb0-5b40-4271-a72e-8ab5a0222c84_1600x798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With the above framework, we can evaluate chatbots through a mechanism that is similar to traditional unit tests, where we define a set of inputs and assert on expected output for each.</p><h3>Bot conversation review &amp; analytics</h3><p>Once the bot has been validated, it can be deployed with a reasonable degree of certainty that it will behave as expected. But then comes the question: how is it performing? To answer this, we defined and instrumented a &#8220;resolution&#8221; metric. This metric informs us of the Bot&#8217;s overall performance and in turn the business impact it generates, and allows us to identify issues and improvement opportunities.</p><p>Bot issue analysis presents challenges compared to error identification and resolution in traditional software development. Concretely bots can err many ways at the software layer and the model layer, and it&#8217;s impossible to know which without manual inspection. To streamline this conversation review process, we built a conversation inspector tool in Streamlit that allows reviewers to load each past conversation and visualize the chat history and action logs similarly to the local testing app:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!42kx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!42kx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 424w, https://substackcdn.com/image/fetch/$s_!42kx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 848w, https://substackcdn.com/image/fetch/$s_!42kx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 1272w, https://substackcdn.com/image/fetch/$s_!42kx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!42kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png" width="520" height="439.2857142857143" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/342a300b-6052-434b-b061-968661a8502e_1600x1352.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1230,&quot;width&quot;:1456,&quot;resizeWidth&quot;:520,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!42kx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 424w, https://substackcdn.com/image/fetch/$s_!42kx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 848w, https://substackcdn.com/image/fetch/$s_!42kx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 1272w, https://substackcdn.com/image/fetch/$s_!42kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F342a300b-6052-434b-b061-968661a8502e_1600x1352.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This tool is available to both Otter developers and non-developers, which has helped scale our efforts to investigate issues and suggest improvements.</p><h2>Lessons from running a bot initiative</h2><p>When we began implementing Otter Assistant last year, there were no established bot guidelines or frameworks. Though frameworks have begun to emerge (e.g. OpenAI&#8217;s <a href="https://openai.github.io/openai-agents-python/">Agents SDK</a>), and as solution providers continue to enhance their offerings, we still feel building in house has proved to be the right decision. Other organizations should weigh their build-vs-buy according to their abilities and the degree of control and customization they foresee requiring for their use cases.</p><p>Beyond build-v-buy, the most important takeaway from the development and launch of Otter Assistant has been the importance of defensible, actionable success metrics to the overall project&#8217;s success. These metrics have proved instrumental in persuading ourselves of the Bot&#8217;s value to Otter and users and establishing a feedback loop to improve the bot over time.</p><p>Lastly, Otter Assistant (specifically, the high fidelity conversational feedback generated by Otter Assistant) has exposed multiple product and platform issues previously lurking undetected in Otter systems. We&#8217;ve thus incorporated bot-derived feedback into our product strategy alongside traditional sources such as user interviews and competitive analysis.</p><h2>Next steps</h2><p>After close to one year of development, Otter Assistant solves ~half of support requests autonomously without compromising customer satisfaction. In future blogs, we will share more about our lessons learned around prompt engineering, as well as best practices we found for how to design and structure functions.</p><p>While it is great to see existing LLMs frameworks already starting to deliver value to our customers and unlocking use cases that weren't possible before, in certain scenarios, we have started to hit limitations on how much we can improve without more fundamental improvements on the LLMs. Therefore we are exploring how to establish a more efficient feedback loop mechanism so the bot can self-sufficiently become smarter over time.</p><p>Looking ahead, we think this is just the beginning of a new era for product design and development. At CSS, we believe agentic chatbots can hugely elevate customer experience. Handling customer support requests is just a starting point!</p><h1>Appendix: Q1 2024 Vendor Comparison</h1><p>After categorizing support tickets, we identified the following key requirements for our chatbot.</p><ol><li><p>LLM-native: no reliance on hard-coded decision trees to define Bot logic</p></li><li><p>Ability to choose the underlying model(s) and control prompt text</p></li><li><p>Ability to update user accounts (stores, menus, orders, printers, etc) via API function calls, while maintaining adherence to Otter&#8217;s access controls and permissions as a guardrail on the above</p></li><li><p>Ability to seamlessly escalate from bot to human within a single chat window</p></li></ol><p>With these key requirements in mind, we conducted an evaluation of third party solutions while simultaneously developing an internal prototype Q&amp;A bot that performed RAG on our existing support knowledge base. Below was our comparison.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zayO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zayO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 424w, https://substackcdn.com/image/fetch/$s_!zayO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 848w, https://substackcdn.com/image/fetch/$s_!zayO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 1272w, https://substackcdn.com/image/fetch/$s_!zayO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zayO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png" width="576" height="246.50428816466552" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:1166,&quot;resizeWidth&quot;:576,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zayO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 424w, https://substackcdn.com/image/fetch/$s_!zayO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 848w, https://substackcdn.com/image/fetch/$s_!zayO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 1272w, https://substackcdn.com/image/fetch/$s_!zayO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9fe3f1-c0f5-4aa3-9ad7-f6eabf1b5c8d_1166x499.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Vendor options as of Q1 &#8216;24</em></figcaption></figure></div><p>Established vendors' products primarily featured hard-coded decision trees and were still working to determine their LLM product strategy. On the other end of the spectrum, we spoke with several startups building LLM-native chatbots for support, but didn&#8217;t encounter one we believed would be able to manage the complexity of resolution steps required for our top issues. We thus decided to build our own in-house Bot back end while leveraging Zendesk&#8217;s Sunco Web SDK front end (to minimize time to market; we have since replaced it with our own custom front end) as our MVP solution.</p>]]></content:encoded></item><item><title><![CDATA[AKS Spot Nodes Harm Nearby Workloads]]></title><description><![CDATA[Why adding Spot nodes to your Azure Kubernetes Service (AKS) cluster will cause user-facing timeouts, even if your services don&#8217;t run on them. And what you can do to fix it.]]></description><link>https://techblog.atoms.co/p/aks-spot-nodes-harm-nearby-workloads</link><guid isPermaLink="false">https://techblog.atoms.co/p/aks-spot-nodes-harm-nearby-workloads</guid><pubDate>Tue, 03 Jun 2025 17:00:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7Xol!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Xol!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Xol!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!7Xol!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!7Xol!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!7Xol!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Xol!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7Xol!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!7Xol!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!7Xol!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!7Xol!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f97bd-cd80-4147-8c79-6d4849bcabdb_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by Wenpin Cui, Shao Liu, and Alex Filipchik, members of the engineering teams that work on infrastructure.</em></p><p>For months, small spikes of unexplained 520/524 errors sat quietly on our dashboards, less than 0.03% max, easily ignored. Nobody complained, so the Infrastructure team put it in the <em>we-will-get-it-done-eventually</em> (read never) bucket. However, even minor issues can signal deeper problems. When one of these elusive errors unexpectedly disrupted a key customer experience, it quickly escalated from a rare anomaly to a critical infrastructure investigation, leading to very intriguing discoveries.</p><h1>Chasing Ghosts in the Infrastructure</h1><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EY1h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EY1h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 424w, https://substackcdn.com/image/fetch/$s_!EY1h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 848w, https://substackcdn.com/image/fetch/$s_!EY1h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 1272w, https://substackcdn.com/image/fetch/$s_!EY1h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EY1h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EY1h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 424w, https://substackcdn.com/image/fetch/$s_!EY1h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 848w, https://substackcdn.com/image/fetch/$s_!EY1h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 1272w, https://substackcdn.com/image/fetch/$s_!EY1h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F739bc02e-741d-49cc-bab0-e99f622da685_1600x400.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>Once we knew that those disruptions were indeed a real problem, we jumped straight to our logs, expecting clear answers. At Cloudkitchens, we run most of our web applications on Azure AKS, fronted by Cloudflare (fairly solid stack) and a very simple proxy (Ingress Gateway), as illustrated by the above diagram, so our first instinct was to look for possible issues in either product stack (bugs) or our own infrastructure. Surprisingly, we found nothing suspicious (like missing health checks or improper shutdown sequences). The next suspect was our service mesh. We inspected all the logs and metrics we collected, but found nothing suspicious.</p><p>We realized we needed deeper instrumentation, so we started capturing Ray IDs, a special trace request header that Cloudflare can supply. The next time the errors hit, we&#8217;d have the fingerprint, which we can trace through all the layers.</p><p>When the next spike came, we caught it immediately, only to find something even stranger. The Ray IDs for these failed requests were still missing from every log inside our cluster. They weren&#8217;t just failing; they never even made it into Kubernetes.</p><p>The mystery deepened: our infrastructure seemed innocent. Was the problem actually upstream, somewhere in Azure&#8217;s managed load balancing?</p><h2>Ray ID: How to</h2><p>If you are wondering how to capture Ray IDs yourself (assuming you are using Istio as well), here is the configuration example:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p2nl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p2nl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 424w, https://substackcdn.com/image/fetch/$s_!p2nl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 848w, https://substackcdn.com/image/fetch/$s_!p2nl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 1272w, https://substackcdn.com/image/fetch/$s_!p2nl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p2nl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png" width="1456" height="487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122782,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/164257754?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p2nl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 424w, https://substackcdn.com/image/fetch/$s_!p2nl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 848w, https://substackcdn.com/image/fetch/$s_!p2nl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 1272w, https://substackcdn.com/image/fetch/$s_!p2nl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318e4571-5dd9-4846-8cd7-a104b20b3f56_1497x501.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y7s8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y7s8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 424w, https://substackcdn.com/image/fetch/$s_!y7s8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 848w, https://substackcdn.com/image/fetch/$s_!y7s8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 1272w, https://substackcdn.com/image/fetch/$s_!y7s8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y7s8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png" width="1456" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65379,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/164257754?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y7s8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 424w, https://substackcdn.com/image/fetch/$s_!y7s8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 848w, https://substackcdn.com/image/fetch/$s_!y7s8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 1272w, https://substackcdn.com/image/fetch/$s_!y7s8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0420632d-616f-4365-ac96-29ec38d425f7_1502x516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h1>Diving Deep Into Azure AKS Internals</h1><p>At this point, we knew the packets were disappearing somewhere upstream, but where? We had to figure out exactly how Azure&#8217;s Load Balancer interacts with AKS services. Initially, we assumed that the Azure Load Balancer picked a healthy pod to route requests to, perhaps randomly or in round-robin fashion. It seemed logical enough, but as it turned out, it wasn't how it worked at all.</p><p>Confused? We were, too. So we dug deeper and realized Azure Load Balancer doesn't route directly to pods. In fact, Azure Load Balancer has no knowledge of Kubernetes pods, nor does it know which node hosts which pod or which pod belongs to which service.</p><p>In reality, each AKS cluster automatically receives two Azure Load Balancers: one for external and one for internal traffic. Every time we create a Kubernetes Service with type Load Balancer, it is mapped to an external Azure LB, which comprises several key components:</p><ul><li><p>Frontend IP: The public IP address for ingress traffic. Each frontend IP maps to a Kubernetes LoadBalancer service.</p></li><li><p>Backend pool: The group of VMs or Virtual Machine Scale Set instances serving requests. AKS creates a backend pool containing all Kubernetes nodes. Crucially, Azure Load Balancer doesn't route directly to pods. Instead, it routes traffic to nodes. Not just nodes running the service, but all nodes in the AKS cluster.</p></li><li><p>Load-balancing rules: Define how incoming traffic is distributed across backend pool instances. Each rule maps a frontend IP configuration and port to multiple backend IP addresses and ports.</p></li></ul><p>Thus, every request makes two hops to reach our pods:</p><ol><li><p>Azure LB &#8594; Node: Azure LB picks a random healthy node from the entire cluster.<br></p></li><li><p>Node &#8594; Pod: The node relies on IPtables rules configured by kube-proxy to forward traffic to one of the ready endpoints associated with the LoadBalancer service, which could be on a different node. IPtable internally also uses a conntrack module to maintain the mapping so at TCP level, the routing is consistent.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Msmp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Msmp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 424w, https://substackcdn.com/image/fetch/$s_!Msmp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 848w, https://substackcdn.com/image/fetch/$s_!Msmp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 1272w, https://substackcdn.com/image/fetch/$s_!Msmp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Msmp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png" width="1456" height="881" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:881,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Msmp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 424w, https://substackcdn.com/image/fetch/$s_!Msmp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 848w, https://substackcdn.com/image/fetch/$s_!Msmp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 1272w, https://substackcdn.com/image/fetch/$s_!Msmp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F904c121a-8658-4f46-9e2f-124c8b594334_1600x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is a detailed example of the configuration in question:</p><p>Given a service that we want to expose externally:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F44A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F44A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 424w, https://substackcdn.com/image/fetch/$s_!F44A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 848w, https://substackcdn.com/image/fetch/$s_!F44A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 1272w, https://substackcdn.com/image/fetch/$s_!F44A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F44A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png" width="1456" height="667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:667,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80383,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/164257754?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F44A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 424w, https://substackcdn.com/image/fetch/$s_!F44A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 848w, https://substackcdn.com/image/fetch/$s_!F44A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 1272w, https://substackcdn.com/image/fetch/$s_!F44A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd22e90c6-3c40-4f49-93c1-62e7604edf1e_1497x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A corresponding frontend IP configuration, matching the public IP found in .status.loadBalancer.ingress[0].ip of the Kubernetes service, can be located within the Azure Load Balancer.</p><p>Additionally, a load balancing rule is established to direct traffic from the frontend IP to the appropriate backend servers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UE5q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UE5q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 424w, https://substackcdn.com/image/fetch/$s_!UE5q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 848w, https://substackcdn.com/image/fetch/$s_!UE5q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!UE5q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UE5q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png" width="1310" height="1600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1600,&quot;width&quot;:1310,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UE5q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 424w, https://substackcdn.com/image/fetch/$s_!UE5q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 848w, https://substackcdn.com/image/fetch/$s_!UE5q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!UE5q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5897405a-29eb-47dc-a680-ba3574f90acf_1310x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the above example rule, you can see that Azure Load Balancer configures the frontend IP, backend pool, and a health probe. The Azure Load Balancer only directs traffic to healthy nodes. The health probe, configured in this rule, determines node health. In this example, by default, Azure checks for a successful TCP connection on port 31260 for each node in the cluster. This port is a <a href="https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport">NodePort</a>, meaning it's open on all AKS cluster nodes, ensuring successful health probes for healthy nodes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oin6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oin6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 424w, https://substackcdn.com/image/fetch/$s_!oin6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 848w, https://substackcdn.com/image/fetch/$s_!oin6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 1272w, https://substackcdn.com/image/fetch/$s_!oin6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oin6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png" width="1456" height="642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:642,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oin6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 424w, https://substackcdn.com/image/fetch/$s_!oin6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 848w, https://substackcdn.com/image/fetch/$s_!oin6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 1272w, https://substackcdn.com/image/fetch/$s_!oin6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f5dbe2-f122-4749-98b6-48a69233c9db_1600x706.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Why do requests vanish?</h1><p>You might be asking: <em>What actually happens when a Kubernetes node suddenly goes away?<br></em>The answer is simpler, and worse, than you might think: timeouts.</p><p>To understand why, let's look closely at Azure Load Balancer&#8217;s behavior. Azure determines node health through periodic health probes (TCP checks) every five seconds. If a node abruptly fails, such as during a spot node eviction, the Azure Load Balancer may still send traffic to it for several seconds, until the VM is eventually removed from the serving pool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xp9S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xp9S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 424w, https://substackcdn.com/image/fetch/$s_!Xp9S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 848w, https://substackcdn.com/image/fetch/$s_!Xp9S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!Xp9S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xp9S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png" width="1456" height="914" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:914,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xp9S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 424w, https://substackcdn.com/image/fetch/$s_!Xp9S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 848w, https://substackcdn.com/image/fetch/$s_!Xp9S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!Xp9S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c46f76-deda-40c2-962e-0c0d6da5b2a1_1600x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>During that brief window, packets sent to these unhealthy nodes vanish silently, never reaching your pods. From our customers' perspective, this manifests as unexplained 520/524 errors from Cloudflare, depending on the exact TCP connection state when the node disappeared.</p><p>But why doesn't AKS handle this better? Ideally, nodes should gracefully terminate, actively closing connections and immediately signaling to Azure Load Balancer to stop routing traffic their way. Unfortunately, AKS doesn't currently handle node termination gracefully. We suspect flaws in this termination process prevent AKS from promptly deregistering nodes from the load balancer backend pool.</p><p>As a result, even healthy services running exclusively on stable nodes are vulnerable to intermittent errors caused by unrelated node failures, especially if Spot VMs are churning.</p><h1>Mitigation: a workaround with trade-offs</h1><p>With clarity on the root cause, the next logical step was to find a solution. We quickly realized that Azure&#8217;s Load Balancer doesn't allow us to explicitly exclude spot nodes, but Kubernetes itself provides an important configuration option: <strong>externalTrafficPolicy</strong>. This policy controls how incoming external traffic is distributed across Kubernetes nodes.</p><h2>Default mode: Cluster</h2><p>By default, Kubernetes sets the <strong>externalTrafficPolicy</strong> to <strong>Cluster</strong>. In this mode, the Azure Load Balancer forwards traffic evenly across <strong>all nodes</strong> in your AKS cluster, regardless of whether or not a given node hosts relevant pods.</p><p>This design aims to evenly spread load, but we've discovered a critical drawback: when unrelated nodes (such as spot instances) fail abruptly, Azure continues to route traffic to these failing nodes until their health checks fail, leading to intermittent packet drops and 520/524 errors.</p><h2>Alternative: Local mode</h2><p>The alternative is to set <strong>externalTrafficPolicy</strong> to <strong>Local</strong>. This option addresses the problem by instructing the Azure Load Balancer to forward traffic only to nodes that are actively running pods associated with the LoadBalancer service.</p><p>As explained clearly in<a href="https://learn.microsoft.com/en-us/azure/aks/load-balancer-standard#local-traffic-policy"> Azure's official AKS documentation</a>:</p><blockquote><p><em>"With the Local traffic policy enabled, the load balancer health probes automatically detect which nodes are running the pod for a given service and only send traffic to those nodes."</em></p></blockquote><p>This means spot nodes, or any other unrelated nodes, will never receive traffic meant for services not explicitly hosted on them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uasj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uasj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 424w, https://substackcdn.com/image/fetch/$s_!uasj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 848w, https://substackcdn.com/image/fetch/$s_!uasj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 1272w, https://substackcdn.com/image/fetch/$s_!uasj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uasj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png" width="1456" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:724,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uasj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 424w, https://substackcdn.com/image/fetch/$s_!uasj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 848w, https://substackcdn.com/image/fetch/$s_!uasj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 1272w, https://substackcdn.com/image/fetch/$s_!uasj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1019cb-315e-47c1-985d-bac79b54bce9_1600x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here's how we configured our Kubernetes service in Local mode:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VAL9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VAL9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 424w, https://substackcdn.com/image/fetch/$s_!VAL9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 848w, https://substackcdn.com/image/fetch/$s_!VAL9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 1272w, https://substackcdn.com/image/fetch/$s_!VAL9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VAL9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png" width="1337" height="569" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:569,&quot;width&quot;:1337,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74625,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.cloudkitchens.com/i/164257754?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VAL9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 424w, https://substackcdn.com/image/fetch/$s_!VAL9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 848w, https://substackcdn.com/image/fetch/$s_!VAL9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 1272w, https://substackcdn.com/image/fetch/$s_!VAL9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35aaf1e2-32a3-4a3b-a246-97953cf1a38e_1337x569.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Most Azure Load Balancer configurations remain unchanged, with one key difference: the health probe switches from TCP to HTTP. Kubernetes manages this transition, ensuring only nodes hosting Istio Ingress Gateway pods pass the health checks. Because the second hop (node to pod) occurs locally in Local mode, latency is also reduced compared to Cluster mode.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!akia!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!akia!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 424w, https://substackcdn.com/image/fetch/$s_!akia!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 848w, https://substackcdn.com/image/fetch/$s_!akia!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 1272w, https://substackcdn.com/image/fetch/$s_!akia!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!akia!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png" width="1456" height="703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:703,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!akia!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 424w, https://substackcdn.com/image/fetch/$s_!akia!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 848w, https://substackcdn.com/image/fetch/$s_!akia!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 1272w, https://substackcdn.com/image/fetch/$s_!akia!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F958a7923-de64-4b5f-aa2d-069a391cee3b_1600x773.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Switching our ingress gateway to <strong>Local</strong> mode immediately solved our 520/524 error problem, reducing errors to nearly zero. It also had a latency benefit, fewer hops mean less delay in the traffic path.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Mcc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Mcc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 424w, https://substackcdn.com/image/fetch/$s_!1Mcc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 848w, https://substackcdn.com/image/fetch/$s_!1Mcc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 1272w, https://substackcdn.com/image/fetch/$s_!1Mcc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Mcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png" width="1456" height="925" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:925,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Mcc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 424w, https://substackcdn.com/image/fetch/$s_!1Mcc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 848w, https://substackcdn.com/image/fetch/$s_!1Mcc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 1272w, https://substackcdn.com/image/fetch/$s_!1Mcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf76cd39-0282-49e3-b96e-49ccc2b3fb6e_1600x1017.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, the <strong>Local</strong> mode comes with trade-offs, also documented by Azure:</p><ul><li><p><strong>Uneven load balancing:</strong> Traffic only flows to nodes hosting pods, which might create load hotspots.</p></li><li><p><strong>Pod scheduling considerations:</strong> If your ingress pods relocate frequently, traffic may be temporarily disrupted as Azure&#8217;s Load Balancer re-adjusts.</p></li></ul><p>Due to these trade-offs, <strong>Local</strong> isn't Kubernetes&#8217; default choice. While our adoption of Local mode proved stable, we still consider it a mitigation rather than a complete solution. Ideally, Azure would provide a more robust fix, such as better node eviction handling or configurable backend pools.</p><p>Until then, Local mode keeps our traffic stable and helps us sleep better at night.</p><h1>Summary</h1><p>We operate several large AKS clusters that experience regular node churn due to the usage of spot instances and our autoscaling policies. During our investigation into persistent but elusive 520/524 errors, we uncovered a fundamental limitation with Azure Load Balancer's default <strong>Cluster</strong> traffic policy.</p><p>The key problems we found:</p><ul><li><p><strong>Insufficient Documentation: </strong>Documentation doesn&#8217;t clearly warn users about the reliability implications of the default <strong>Cluster</strong> policy. <br></p></li><li><p><strong>Default Mode Pitfalls: </strong>AKS defaults to <strong>Cluster</strong> mode, suitable perhaps for smaller or stable clusters but problematic at scale, especially when spot nodes are present.<br></p></li><li><p><strong>Limited Flexibility: </strong>AKS currently offers no way to selectively exclude unstable nodes or node pools from the Load Balancer backend pool, limiting our ability to control availability.<br></p></li><li><p><strong>Suboptimal Node Termination: </strong>AKS does not gracefully deregister terminating nodes from Azure Load Balancer backends, leaving windows of vulnerability during node evictions.<br></p></li></ul><p>Switching to the <strong>Local</strong> external traffic policy significantly reduced our ingress availability issues. But <strong>Local</strong> comes with trade-offs, such as uneven load distribution and increased scheduling complexity. To mitigate this, we use pod anti-affinity rules to ensure even distribution.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://techblog.atoms.co/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[ML Infrastructure Doesn’t Have To Suck]]></title><description><![CDATA[Our journey to build a simple but effective user-centric ML stack]]></description><link>https://techblog.atoms.co/p/ml-infrastructure-doesnt-have-to</link><guid isPermaLink="false">https://techblog.atoms.co/p/ml-infrastructure-doesnt-have-to</guid><pubDate>Thu, 06 Mar 2025 00:50:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FP5j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FP5j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FP5j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 424w, https://substackcdn.com/image/fetch/$s_!FP5j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 848w, https://substackcdn.com/image/fetch/$s_!FP5j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 1272w, https://substackcdn.com/image/fetch/$s_!FP5j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FP5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png" width="1729" height="970" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:970,&quot;width&quot;:1729,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2001486,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://techblog.citystoragesystems.com/i/158400481?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f276641-daa8-43c6-aa36-79a1be39cdd3_1862x1030.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FP5j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 424w, https://substackcdn.com/image/fetch/$s_!FP5j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 848w, https://substackcdn.com/image/fetch/$s_!FP5j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 1272w, https://substackcdn.com/image/fetch/$s_!FP5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2bbd38-4719-417a-b508-5092d72112e1_1729x970.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><em>Written by <a href="https://www.linkedin.com/in/kmanamcheri/">Karthik Manamcheri</a> and <a href="https://www.linkedin.com/in/alexander-filipchik-7946894/">Alex Filipchik</a>, who lead the data and the infrastructure teams.</em></p><h1>Introduction</h1><p>In a perfect world ML infrastructure would work like a well-oiled machine, balancing competing needs for flexibility, usability, maintainability, and cost effectiveness. Time from idea to production would be mere minutes. Let's be honest: many companies, including ours, often fall short. Users face a jigsaw puzzle of systems cobbled together with digital duct tape. "Synergy" isn&#8217;t exactly the word that came to mind.</p><p>Over the last year, we&#8217;ve begun fixating on productivity of Data Scientists foremost. We swapped out some conventional tools and patterns. Data Science teams estimate iteration speed has already increased several fold with our initial results.</p><h1>Lessons from Our First Attempt</h1><p>Our initial ML infrastructure was ambitious but flawed. We aimed to build an industry-leading and future-proof system which supports undefined use cases and avoids vendor lock-in. This led us to a tech stack that included</p><ul><li><p><a href="https://kubernetes.io/">Kubernetes</a> and <a href="https://istio.io/">Istio</a> for the compute and service mesh</p></li><li><p><a href="https://trino.io/">Trino</a> and <a href="https://hudi.apache.org/">Apache Hudi</a> as the data warehouse layer</p></li><li><p><a href="https://argoproj.github.io/">Argo</a> and a modified kubeflow pipelines SDK for workload orchestration</p></li><li><p><a href="https://mlflow.org/">MLFlow</a>, <a href="https://jupyter.org/">Jupyter Notebooks</a>, and <a href="https://www.seldon.io/">Seldon</a> for rapid model iteration</p></li><li><p>and <a href="https://bazel.build/">Bazel</a> as our build system</p></li></ul><p>While this setup offered flexibility and extensibility, it was overly complicated. Data scientists faced steep learning curves with containers, Helm, and YAML. Simple ML tasks took over 10 minutes to start, and setup times for more complex tasks could exceed 45 minutes. Bazel, though powerful for languages like Java and Go, was cumbersome for Python due to compatibility issues.</p><p>Through this experience, we learned several key lessons:</p><ul><li><p><strong>Complexity Hinders Productivity</strong>: Even powerful tools can become a liability if they're too complicated.</p></li><li><p><strong>User Experience is Crucial</strong>: Prioritizing usability is essential to ensure efficiency and ease of use.</p></li><li><p><strong>Flexibility vs. Functionality</strong>: It&#8217;s important to balance flexibility with the need for a simpler, more maintainable system.</p></li></ul><p>Realizing these complexities were holding us back, we knew it was time for a change. In the next section, we&#8217;ll explore our new, user-focused ML infrastructure. This redesigned stack eliminates the pain points of the past while maintaining the scalability and ensuring we are prepared for future needs.</p><h1>The New ML Infrastructure</h1><p>Internally, we call our stack the &#10024;<strong>DREAM</strong>&#10024; stack!</p><p><strong>D</strong>aft<br><strong>R</strong>ay<br>po<strong>E</strong>try<br><strong>A</strong>rgo<br><strong>M</strong>etaflow</p><p>Our new ML Infrastructure can be viewed in roughly 3 layers: <strong>Infrastructure</strong>, <strong>Services</strong>, and <strong>Libraries</strong>. Here is a diagram of each layer and corresponding tech stack:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IGgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IGgE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 424w, https://substackcdn.com/image/fetch/$s_!IGgE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 848w, https://substackcdn.com/image/fetch/$s_!IGgE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 1272w, https://substackcdn.com/image/fetch/$s_!IGgE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IGgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png" width="604" height="254.7087912087912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:614,&quot;width&quot;:1456,&quot;resizeWidth&quot;:604,&quot;bytes&quot;:181651,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.citystoragesystems.com/i/158400481?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IGgE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 424w, https://substackcdn.com/image/fetch/$s_!IGgE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 848w, https://substackcdn.com/image/fetch/$s_!IGgE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 1272w, https://substackcdn.com/image/fetch/$s_!IGgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3398eb01-e068-49a2-968d-235be3a977a7_1526x644.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong><a href="https://kubernetes.io/">Kubernetes</a>: </strong>As we mentioned, we are a big Kubernetes kitchen (look at our cool Kubernetes related blog posts <a href="https://techblog.citystoragesystems.com/p/managing-100s-of-kubernetes-clusters">here</a>, <a href="https://techblog.citystoragesystems.com/p/swapping-disks-in-kubernetes">here</a>, and <a href="https://techblog.citystoragesystems.com/p/kubernetes-self-healing">here</a>) and we run all our applications and services on it. Our old and new ML Infrastructure run on Kubernetes, so there are no changes there.</p><p><strong><a href="https://argoproj.github.io/">Argo</a>:</strong> Our trusted workflow orchestrator over the past five years. However, it does come with some challenges. First, it&#8217;s quite low-level, often diving into infrastructure details that users would rather not deal with. The primary interface is YAML, which can become complex quickly, even with language wrappers. Another challenge is that developing and testing locally is significantly different from running it in a cluster. Local development requires custom CRDs and ongoing maintenance, making the process a bit complicated. To address these issues, we introduced Metaflow, which tackles these specific challenges and simplifies the workflow by keeping Argo behind the scenes.</p><p><strong><a href="https://metaflow.org/">Metaflow</a>: </strong>This is a user-centric framework that helps data scientists and engineers build, manage, and deploy data workflows at scale.. Metaflow lets you write and debug workflows locally, then scale them effortlessly to production with Argo (or others) when you're ready to go big. It delivers the 'infinite laptop' experience by integrating smoothly with Argo and Kubernetes, giving you the power of the cloud without the headache of YAML. And since it&#8217;s all in Python, our users can finally leave the YAML struggles behind and focus on what they do best&#8212;creating impactful ML projects.</p><p><strong><a href="https://www.ray.io/">Ray</a>: </strong>Ray is a distributed compute engine for data and AI workloads. Ray serves up a high-performance cluster with a side of Python libraries, making it a breeze to interact with. It&#8217;s got a menu of ready-to-use libraries for ML tasks like Ray Data for preprocessing, Ray Train for training, and Ray Serve for serving.</p><p>Ray is positioned primarily as a compute engine for AI/ML tasks. Given its architecture, we believe it can also excel in data processing, potentially disrupting well-established frameworks like Spark and Flink. That's why we are doubling down on Ray as our main data compute framework. However, one issue with the Ray library for data processing, Ray Data, is that it doesn't cover the full range of DataFrame/ETL functions and its performance could be improved. That's where Daft is here to save the day&#8211;</p><p><strong><a href="https://www.getdaft.io/">Daft</a>: </strong>Our latest and greatest kitchen gadget for ETL, analytics, and AI/ML at scale. It fills the gap of Ray Data by providing amazing DataFrame APIs that cover our needs. In our tests, it&#8217;s faster than Spark and uses fewer resources. Plus, its seamless integration with Ray has made it our go-to for whipping up scalable, high-performance data workloads. Our users are loving its APIs and experience. It makes working with big datasets a breeze. And since it&#8217;s written in Rust, it&#8217;s practically a tech Twitter darling by default &#128539;</p><p><strong><a href="https://python-poetry.org/">Poetry</a>:</strong> Given the challenges of mixing Bazel and Python&#8212;an oil-and-water scenario&#8212;we sought a more suitable alternative and landed on Poetry. This well-established tool proved to be both pluggable and extensible, perfectly aligning with our needs. Its ability to support multiple library versions has been a game-changer, accommodating Python&#8217;s diverse and ever-evolving ecosystem. The adoption of Poetry has significantly improved the efficiency of our Python developers, reducing dependency conflicts and simplifying library updates. While Bazel remains our tool of choice for other languages, Poetry has streamlined our Python workflows, making development smoother and more productive.</p><p>In the next section, we&#8217;ll showcase how we used the new ML stack using a case study</p><h1>Case Study: Calculating high-demand zones</h1><p>At CloudKitchens, we&#8217;re all about turning chaos into culinary gold&#8212; and that starts with data. Let us tackle a deceptively simple task: calculating <strong>high-demand zones </strong>for food orders. Why? Because it is very useful to know where people are hungry before building out a new kitchen facility.</p><h2>The Challenge: Find Hungry Zones Before Dropping $$$ on Real Estate</h2><p>This would have taken us a week (or more) to build on our previous stack. It would be amazing if we could do this in a single day. Below we&#8217;ll show how, with our new stack. As with every problem, data is the first step&#8230;</p><h2>Step 0: Scaffolding the Dream (aka Hello, World! &#128075;)</h2><p>Every epic begins with a humble start. Ours starts with a folder.</p><pre><code>mkdir my_project1
cd my_project1
poetry init</code></pre><p>&#127881; Voil&#224;! A Python project is born. If you&#8217;re new to Poetry, the <code>pyproject.toml</code> file is like the ingredient list for our project. Here&#8217;s a quick look:</p><pre><code>[tool.poetry]
name = "my-project1"
version = "0.1.0"
description = ""
authors = ["my.name &lt;my.name@cloudkitchens.com&gt;"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.12"
css-metaflow = "*"
h3 = "*"
css-dw = "*"
...</code></pre><p>It&#8217;s basic for now, but we&#8217;re about to turn this into a Michelin-starred script.</p><h2>Step 1: Defining the Flow (the recipe &#128104;&#8205;&#127859;)</h2><p>We need to break down the task into manageable bites. Enter Metaflow&#8212;our trusty sous-chef for data workflows. Here&#8217;s what our recipe looks like:</p><ol><li><p>Start: Logs that something&#8217;s happening.</p></li><li><p>Read: Get the order data.</p></li><li><p>Calculate: Identify high-demand zones.</p></li><li><p>Write: Save results for future feasting.</p></li><li><p>End: Logs that we&#8217;re done.</p></li></ol><p>Here&#8217;s the skeleton of our flow:</p><pre><code>from metaflow import FlowSpec, step, poetry

@poetry
class MyFlow(FlowSpec):
    @step
    def start(self):
        print("Starting the flow!")
        self.next(self.read)

    @step
    def read(self):
        # Fetch data
        self.next(self.calculate)

    @step
    def calculate(self):
        # Calculate high-demand zones
        self.next(self.write)

    @step
    def write(self):
        # Save results
        self.next(self.end)

    @step
    def end(self):
        print("Flow complete!")

if __name__ == "__main__":
   MyFlow()</code></pre><p>This file is a straightforward Python script with a few well-defined functions, but there are three key elements worth noting:</p><ol><li><p><strong>Inheriting </strong><code>FlowSpec</code><strong>:</strong> This step transforms the class into a Metaflow class, enabling it to function as part of a Metaflow workflow.</p></li><li><p><strong>Decorating functions with </strong><code>@step</code><strong>:</strong> Each function marked with @step becomes a distinct step in the workflow, orchestrated by Metaflow.</p></li><li><p><strong>Decorating the class with </strong><code>@poetry</code><strong>:</strong> This is our in-house Metaflow plugin decorator. It reads the <code>pyproject.toml</code> file and ensures that all specified dependencies are seamlessly integrated and available within the Metaflow environment.</p></li></ol><h2>Step 2: Implementing the Flow (the cooking &#127859;)</h2><p>Here&#8217;s where the magic happens:</p><ul><li><p><strong>Read</strong>: Pulls data on fulfilled orders</p></li><li><p><strong>Calculate:</strong> Groups orders by geographical <strong>H3 hexes</strong>.</p></li><li><p><strong>Write:</strong> Saves the results to our data warehouse.</p></li></ul><p>Here&#8217;s the same flow in previous step, but with implementation:</p><pre><code>import logging

from metaflow import FlowSpec, poetry, step, project
import css_dw as dw
import h3

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@poetry
class MyFlow(FlowSpec):
   """Calculates high-demand zones."""

   @step
   def start(self) -&gt; None:
       """Start the flow."""
       logger.info("MyFlow run started!")
       self.next(self.read)

   @step
   def read(self) -&gt; None:
       """Reads order data from the data warehouse."""
       self.read_table_result = dw.read_sql_as_table(
           sql="""
               SELECT order_id, latitude, longitude,
                      subtotal, order_time
               FROM fulfilled_orders
               WHERE city='San Francisco, CA' AND
                     order_time &gt;= '2024-01-01'
           """,
       )
       self.next(self.calculate_high_demand_zones)

   @step
   def calculate_high_demand_zones(self) -&gt; None:
       """Calculates high-demand zones."""
       df = dw.read_sql_as_df(
           schema=self.read_table_result.schema_name,
           table=self.read_table_result.table_name,
           df_type="daft",
       )
       h3_resolution = 8
       df = df.with_column(
           "h3_area",
           df["latitude"].zip(df["longitude"]).map(
               lambda ll: h3.geo_to_h3(ll[0], ll[1], h3_resolution)
           )
       )
       self.summary_df = df.group_by("H3 Area")
                           .agg({"subtotal": "sum"})
                           .rename({"subtotal_sum": "Total Orders"})
       self.next(self.write)

   @step
   def write(self) -&gt; None:
       """Writes high-demand zones to the data warehouse."""
       self.result = dw.write_df_as_table(
           df=self.summary_df,
           mode="overwrite",
       )
       logger.info("Output written to DW.")
       self.next(self.end)

   @step
   def end(self) -&gt; None:  # pylint: disable=no-self-use
       """End the flow."""
       logger.info("MyFlow is ending")

if __name__ == "__main__":
   MyFlow()</code></pre><p><strong>The result?</strong> A table of hexagonal zones with order counts. Think of it as a heatmap for hangry customers.</p><h2>Step 3: Scaling Up (Because 200 Million Rows Is a Lot)</h2><p>If you&#8217;re working with <em>millions</em> of orders, your laptop might throw in the towel. Enter <strong>Ray</strong>, our scalability hero. With just one decorator, we turn the &#8220;calculate&#8221; step into a parallelized powerhouse:</p><pre><code>@step
@raystep
def calculate_high_demand_zones(self) -&gt; None:
 ...</code></pre><p>That&#8217;s it. Ray handles the heavy lifting, so you can focus on the fun stuff.</p><p>The <code>@raystep</code> decorator is an in-house Metaflow extension, much like <code>@poetry</code>. It seamlessly integrates Ray into your workflow by submitting the code package to a Ray cluster for execution. This allows your tasks to scale effortlessly across multiple nodes without requiring any changes to your underlying code.</p><h2>Step 4: Automating the Flow (Because Daily Data Is Best Data)</h2><p>Want this flow to run daily? No problem. Add a schedule decorator:</p><pre><code>@schedule(daily=True)
@poetry
class MyFlow(FlowSpec):
...</code></pre><p>Then deploy it with Argo Workflows:</p><pre><code>poetry run python my_project/flow1.py argo-workflows create</code></pre><h2>Result: The DREAM Stack in Action</h2><p>By combining the best tools, we created a scalable, maintainable, and automated system for identifying high-demand zones. This helps the business make decisions such as buying real-estate for future cloud kitchens a breeze! In addition to this, we are also on the cutting edge of data processing tools here. The tools we have picked are less than 5 years old, and are being actively developed,</p><h1>Wrapping it all up</h1><p>In our journey to revamp our ML infrastructure, we've learned a lot about balancing flexibility, usability, and performance. The result? A highly efficient and user-friendly environment for our data scientists and engineers.</p><p>Our new setup, the DREAM stack, includes:</p><ul><li><p>Daft: For data manipulation and calculations</p></li><li><p>Ray: For scalability</p></li><li><p>Poetry: For dependency management</p></li><li><p>Argo: For workflow automation</p></li><li><p>Metaflow: For a superior developer experience</p></li></ul><p>Together, they make data engineering simple, effective and easy-to-use.</p><p>The impact has been immediate: our new data scientists can now spin up projects in a matter of hours, a process that previously took weeks. This increased productivity is a testament to the effectiveness of the DREAM stack.</p><p>Thank you for joining us on this journey. We hope our insights and experiences help you in building an effective ML infrastructure that doesn't suck. We&#8217;re going to keep pushing and making this even better. If you&#8217;re passionate about working on this, <a href="https://www.linkedin.com/in/kmanamcheri/">message me on LinkedIn</a> or <a href="https://cloudkitchens.com/careers/">look at our positions</a>. We are always looking for people to help us push the boundaries.</p>]]></content:encoded></item><item><title><![CDATA[Easy as Pie: Stateful Services at Atoms]]></title><description><![CDATA[How Atoms uses Splitter to build real-time, stateful services that horizontally scale.]]></description><link>https://techblog.atoms.co/p/easy-as-pie-stateful-services-at</link><guid isPermaLink="false">https://techblog.atoms.co/p/easy-as-pie-stateful-services-at</guid><pubDate>Fri, 25 Oct 2024 16:24:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lvY3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lvY3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lvY3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 424w, https://substackcdn.com/image/fetch/$s_!lvY3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 848w, https://substackcdn.com/image/fetch/$s_!lvY3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!lvY3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lvY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png" width="1456" height="970" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:970,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lvY3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 424w, https://substackcdn.com/image/fetch/$s_!lvY3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 848w, https://substackcdn.com/image/fetch/$s_!lvY3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!lvY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3166d5-133a-4fc6-a7ea-65a10266f52b_1600x1066.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by Jordan Hurwitz and Henning Rohde, members of the engineering teams that work on infrastructure.</em></p><p>At Atoms, our software systems interact with humans in kitchens around the world in real time. From <a href="https://techblog.citystoragesystems.com/p/robotic-order-conveyance">robotic conveyance</a> to reliable <a href="https://techblog.citystoragesystems.com/p/reliable-order-processing">messaging infrastructure</a>, these systems need to &#8220;just work&#8221; to not impede food order fulfillment &#8211; yet, at scale, it is traditionally difficult to design and build them well enough to satisfy our latency, correctness and reliability requirements.</p><p>There are no simple cookie cutter options. On one hand, stateless designs that rely on databases tend to be too slow. On the other hand, stateful designs that coordinate using Etcd, Redis or internal consensus protocols do achieve low latency, but tend to be too complex and ultimately incorrect or unreliable. Older versions of our systems ran into these conundrums.</p><p>To overcome them, we built an <em>opinionated</em> sharding service, Splitter, aimed at exclusive in-memory ownership with dynamic explicit assignments, load balancing, and client-controlled routing logic. Today, our most critical, high-performance kitchen systems rely on Splitter under the hood.</p><h2>The Stateless Model</h2><p>The majority of distributed services are designed to be <em>request-driven</em> &#8211; i.e., actions are taken only in response to a request &#8211; and <em>stateless</em> &#8211; i.e., service instances hold no state beyond the request handling. Persistent state is stored externally, typically in a database, and written and fetched on demand. More requests to the service are directly reflected in more requests to the database.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A0ou!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A0ou!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 424w, https://substackcdn.com/image/fetch/$s_!A0ou!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 848w, https://substackcdn.com/image/fetch/$s_!A0ou!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 1272w, https://substackcdn.com/image/fetch/$s_!A0ou!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A0ou!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png" width="1456" height="659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:659,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A0ou!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 424w, https://substackcdn.com/image/fetch/$s_!A0ou!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 848w, https://substackcdn.com/image/fetch/$s_!A0ou!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 1272w, https://substackcdn.com/image/fetch/$s_!A0ou!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed128e5-c8ab-4efe-8030-80be75ddd723_1472x666.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 1. A request-driven stateless service with consistency handled by the database. Local caching is not possible if we need strong read-after-write consistency.</figcaption></figure></div><p>There are a number of perceived advantages to this model:</p><ul><li><p><strong>Scalability</strong>. Horizontal scalability can be nominally achieved simply by deploying more instances of the service. However, external dependencies must have their own scalability story and databases often become a bottleneck without careful capacity planning and expertly-crafted database queries and indices.</p></li></ul><ul><li><p><strong>Simplicity</strong>. Service implementation complexity starts lower, since the service limits itself to request handling. However, as requirements on the service increase, they become harder and more expensive to implement, paradoxically ending in an overall more complex system. Low latency and consistency become hard to achieve. How do you rate-limit specific resources? How do you invalidate caches correctly? How do you design performant push-based apis?&nbsp;</p></li></ul><p>Stateless services often end up depending on a battery of external services to overcome these challenges. Caches are used to reduce database round-trips. Timer services provide request triggers. Message queues are used to chain together multiple stateless services. But each dependency adds failure modes, latency, and inevitably introduces its own problems.</p><h2>The Sharded Model</h2><p>As an alternative, services can be designed with internal coordination and statefulness instead of necessarily relying on external services. It opens up both powerful opportunities and an abundance of pitfalls. At the heart of its difficulty is horizontal scalability.</p><p>The <em>sharded</em> model scales by breaking up ownership of the work domain (e.g. users, orders, stores) into <em>shards</em> and dividing it among service instances. Each shard owner holds exclusive control over certain actions, depending on the use case. For example, a service can use shard ownership to maintain an authoritative in-memory cache backed by the usual database. By making all read and write requests go through the shard owner, it can return cached values from memory with a global read-after-write consistency guarantee &#8211; in contrast to the stateless design above.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qGpY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qGpY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 424w, https://substackcdn.com/image/fetch/$s_!qGpY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 848w, https://substackcdn.com/image/fetch/$s_!qGpY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 1272w, https://substackcdn.com/image/fetch/$s_!qGpY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qGpY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png" width="1456" height="659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:659,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qGpY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 424w, https://substackcdn.com/image/fetch/$s_!qGpY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 848w, https://substackcdn.com/image/fetch/$s_!qGpY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 1272w, https://substackcdn.com/image/fetch/$s_!qGpY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a5405e8-8472-4a0a-9c22-4446319b1288_1472x666.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 2. A stateful service with authoritative write-through cache. Reads can be served from memory with strong consistency.</figcaption></figure></div><p>The sharded model provides low-latency and correctness at scale when done well.</p><p>But what about reliability and complexity?</p><p>This is indeed where one easily runs into trouble. In addition to a partitionable work domain, sharded services implicitly also need:</p><ol><li><p><strong>Ownership</strong>. A mechanism for assigning and distributing shard ownership; and</p></li><li><p><strong>Routing</strong>. A mechanism for routing requests to current shard owners.</p></li></ol><p>Two approaches are common: either roll your own internal work distribution mechanism, using a gossip or consensus algorithm or a persistence-based sharding library; or rely on a generalized data store for coordination that implements consensus under the hood like Etcd or Zookeeper. Both approaches require significant commitment and tend to get complex fast. Advanced concerns such as re-sharding, multi-region domains, or graceful handover are often neglected.</p><p>Stateless services are widespread for good reasons. At Atoms, we took a different approach.</p><h2>The Splitter Model</h2><p>Splitter is a highly-resilient multi-region control-plane service for distributing work to connected clients. Splitter makes a number of opinionated tradeoffs aimed at reliability and ergonomics for sharding our real-time kitchen services.&nbsp;</p><p>First of all, it targets reality in a modern cloud environment:</p><ul><li><p><strong>Dynamic</strong>: service instances are ephemeral. They generally do not use local physical disks and may come and go as the service scales up or down. There should be no special logic needed for auto-scaling nor regional failure. Shards should be load-balanced across whatever instances are connected and shard re-assignments might be mildly disruptive at worst, but fundamentally benign.</p></li></ul><ul><li><p><strong>Multi-region</strong>: services are deployed across multiple regions for geo-resiliency. Shard assignments should align well with underlying databases, where we know or control the data placement for region-local latency. Shard placement decisions should not be the concern of the service and operational controls around data location should be available.</p></li></ul><p>The key needs of the sharded model are handled as follows:</p><ol><li><p><strong>Ownership</strong>. Domain and shard management is configured and handled centrally in Splitter, which internally uses Raft for distributed storage and coordination. At runtime, service instances establish a connection to Splitter to start receiving leased shard assignments. If an instance crashes or loses connectivity for too long, its lease lapses and its shards are assigned to other connected instances.</p></li></ol><ol start="2"><li><p><strong>Routing</strong>. Connected instances receive up-to-date shard routing information, i.e., which instance owns what, that can be used for internal forwarding and fanout. This &#8220;routing logic as a library&#8221; approach is essential: it works for both batch and streaming applications, supports non-1:1 routing and leaves the service in control of the data path.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!InxZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!InxZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 424w, https://substackcdn.com/image/fetch/$s_!InxZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 848w, https://substackcdn.com/image/fetch/$s_!InxZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!InxZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!InxZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png" width="1456" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://techblog.atoms.co/i/149818978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!InxZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 424w, https://substackcdn.com/image/fetch/$s_!InxZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 848w, https://substackcdn.com/image/fetch/$s_!InxZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!InxZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f4ed12-643f-4ae7-b73c-b5e3edf1d3f1_2216x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig3. A sharded service using Splitter. Each shard holds exclusive ownership and can readily be used as authoritative write-through cache.</figcaption></figure></div><p>Splitter deliberately supports only three kinds of work domains: UNIT (singleton), GLOBAL (UUIDs) and REGIONAL (UUIDs with region). A singleton domain is used for leader election in certain low-scale and advanced use cases. A non-singleton shard is a half-open UUID range for a domain. This seemingly peculiar choice provides a number of advantages over discrete or custom shard spaces (e.g. integers or strings). UUIDs force a uniform distribution that naturally avoids hotspots and shards can be evenly split or merged dynamically. Since services in practice never know their eventual scale upfront as the business evolves, no-downtime re-sharding is a crucial feature. Most entity identifiers are UUIDs as well.</p><p>Finally, Splitter is also notable for what it does not do. It does not offer storage, instead assuming services use a database for such needs. It does not offer distributed locks, favoring instead &#8220;ownership is locking". It does not offer dynamic shard weighting, due to its complex service interaction. Choice is great when it comes to food, but not always for coordination primitives.</p><h2>Sharding in Practice</h2><p>In this post, we&#8217;ve alluded to <a href="https://techblog.citystoragesystems.com/p/robotic-order-conveyance">Robotic Conveyance Routing (RCR)</a> and <a href="https://techblog.citystoragesystems.com/p/reliable-order-processing">Keyed Event Queue (KEQ)</a> as examples of real-time applications that rely on Splitter in different ways for internal distributed state and coordination.</p><p>For RCR, one of the main challenges is that controlling robots involves high-frequency location sensor telemetry that far exceeds what is practical to persist in a database. RCR uses facility id as a GLOBAL work domain, with each facility handler controlling all robots in it. The facility handler is responsible for both subscribing to telemetry feeds for its related robots and communicating tasks to the connected robots. It is critical that exactly one connection is made per robot to the vendor broker. Robot telemetry largely resides in memory for real-time control, unlike persisted task queues and robot metadata.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6BEo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6BEo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 424w, https://substackcdn.com/image/fetch/$s_!6BEo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 848w, https://substackcdn.com/image/fetch/$s_!6BEo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 1272w, https://substackcdn.com/image/fetch/$s_!6BEo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6BEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png" width="1362" height="864" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1362,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6BEo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 424w, https://substackcdn.com/image/fetch/$s_!6BEo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 848w, https://substackcdn.com/image/fetch/$s_!6BEo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 1272w, https://substackcdn.com/image/fetch/$s_!6BEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F274c6670-bd1f-4169-9d5a-06be17e34b4c_1362x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 4. Facility shard handlers coordinating communication between internal services and the external robot vendor MQTT broker.</figcaption></figure></div><p>KEQ is a more advanced example. KEQ is a multi-region message broker that stores messages across millions of queues in a multi-region distributed database for resiliency. A key challenge is that such a database incurs high transaction latencies, which is not conducive for real-time message delivery.</p><p>For each topic, KEQ uses a Splitter domain to maintain consumer progress in-memory and a small authoritative write-through cache for enqueued messages. New messages are briefly cached and then evicted once all consumers have processed them, usually within seconds. In steady state, KEQ does not read messages from the database at all, even as consumers are slightly out of sync with each other. This setup is eminently effective for low-latency message delivery while respecting per-queue ordering requirements.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sE8T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sE8T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 424w, https://substackcdn.com/image/fetch/$s_!sE8T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 848w, https://substackcdn.com/image/fetch/$s_!sE8T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 1272w, https://substackcdn.com/image/fetch/$s_!sE8T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sE8T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png" width="1372" height="904" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:904,&quot;width&quot;:1372,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sE8T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 424w, https://substackcdn.com/image/fetch/$s_!sE8T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 848w, https://substackcdn.com/image/fetch/$s_!sE8T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 1272w, https://substackcdn.com/image/fetch/$s_!sE8T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef4204e8-09cb-4986-8609-deeaedf058ec_1372x904.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 5. Broadcasting post-commit messages from the authoritative in-memory cache.</figcaption></figure></div><p>KEQ also uses a UNIT domain to elect a leader for assigning and load-balancing subscribers to specific ranges. The leader uses the Splitter routing information to match subscribers with appropriate topic ranges and is an example of non-trivial routing made possible by design.</p><h2>Conclusion</h2><p>The Work Distribution Service is an opinionated control-plane service for shard management. Services integrated with Splitter can take advantage of powerful built-in primitives for shard ownership and routing to satisfy strict latency, correctness and reliability requirements.</p><p>Splitter was built to enable real-time scalable kitchen systems that go beyond what the stateless model can achieve. It succeeded. Our most critical, high-performance services rely on Splitter as a secret ingredient and are simpler and more reliable for it.</p>]]></content:encoded></item><item><title><![CDATA[Nuance: Preventing Schema Migrations From Causing Outages]]></title><description><![CDATA[How Atoms automates the analysis of our database schema migrations.]]></description><link>https://techblog.atoms.co/p/nuance-preventing-schema-migrations</link><guid isPermaLink="false">https://techblog.atoms.co/p/nuance-preventing-schema-migrations</guid><pubDate>Wed, 09 Oct 2024 14:45:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UFhr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UFhr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UFhr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 424w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 848w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1272w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UFhr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png" width="1192" height="572" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:572,&quot;width&quot;:1192,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UFhr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 424w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 848w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1272w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by Alexey Pavlenko and Pasha Yakimovich, members of the engineering teams that work on storage infrastructure.</em></p><p>At Atoms, we reviewed all database-related outages over the last few years and found that roughly 80% were caused by schema management issues. Our analysis revealed several contributing factors:</p><ol><li><p><strong>Wrong assumptions about the current state of the target database.</strong> For example, a proposed database schema may not match the one subsequently generated by migration software (e.g. Flyway) via incremental updates. This could happen if the production schema was manually modified ad-hoc at some point (unknown to the same migration software).</p></li><li><p><strong>Wrong assumptions about how the schema migration is executed.</strong> Various database technologies have their own gotcha moments (<a href="https://www.cockroachlabs.com/docs/stable/online-schema-changes">CockroachDB, for example</a>).</p></li><li><p><strong>Not knowing nuances of how a particular database implements a schema.</strong> As in the previous point, there are <a href="https://www.cockroachlabs.com/docs/stable/postgresql-compatibility#:~:text=CockroachDB%20is%20compatible%20with%20version,most%20PostgreSQL%20drivers%20and%20ORMs.">plenty of opportunities to make mistakes</a>.</p></li><li><p><strong>Forgetting the consuming application&#8217;s usage patterns.</strong> A classic example is deleting a column or index that is still in use.</p></li></ol><p>Let&#8217;s walk through a few case studies. All of them were taken from real production outages (simplified for readability). They mostly apply to Postgres and CockroachDB, but are still relevant to other relational database technologies.</p><h3>Example 1: Broken Uniqueness Constraint</h3><p>Say we have a table bad_1 defined as follows. What could go wrong?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Afr0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Afr0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 424w, https://substackcdn.com/image/fetch/$s_!Afr0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 848w, https://substackcdn.com/image/fetch/$s_!Afr0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 1272w, https://substackcdn.com/image/fetch/$s_!Afr0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Afr0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png" width="1456" height="389" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:389,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Afr0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 424w, https://substackcdn.com/image/fetch/$s_!Afr0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 848w, https://substackcdn.com/image/fetch/$s_!Afr0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 1272w, https://substackcdn.com/image/fetch/$s_!Afr0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56413953-551c-42f1-b7cc-57303bc3e23f_1600x427.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The problem is that the c1 and c2 columns can both be NULL, as they lack a NOT NULL declaration. Consequently, the UNIQUE constraint won&#8217;t be enforced on the (c1, c2)pair if either is NULL. It&#8217;s worth remembering that NULL in SQL is not a value, but rather the absence of a value. Therefore, comparisons with NULL always produce false, which allows the columns to keep identical values despite the constraint (multiple rows could contain c1=123, c2=NULL). Applications that rely on the constraint must also ensure that no null values are inserted into the table.</p><p>This behavior can be altered via <a href="https://www.postgresql.org/docs/current/indexes-unique.html">NULLS NOT DISTINCT</a> in Postgres, but one has to know about this limitation in advance.</p><h3>Example 2: Silently Dropping An Index</h3><p>Let&#8217;s examine a schema migration. Say we have a production database defined as follows.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mYoX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mYoX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 424w, https://substackcdn.com/image/fetch/$s_!mYoX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 848w, https://substackcdn.com/image/fetch/$s_!mYoX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 1272w, https://substackcdn.com/image/fetch/$s_!mYoX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mYoX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png" width="1456" height="424" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:424,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mYoX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 424w, https://substackcdn.com/image/fetch/$s_!mYoX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 848w, https://substackcdn.com/image/fetch/$s_!mYoX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 1272w, https://substackcdn.com/image/fetch/$s_!mYoX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff37ea313-468f-40fa-98ba-b126cf3777e0_1600x466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Seeing that column c1 is no longer being used by the application, a developer submits a schema update.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AHyg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AHyg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 424w, https://substackcdn.com/image/fetch/$s_!AHyg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 848w, https://substackcdn.com/image/fetch/$s_!AHyg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 1272w, https://substackcdn.com/image/fetch/$s_!AHyg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AHyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png" width="323" height="34" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:34,&quot;width&quot;:323,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AHyg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 424w, https://substackcdn.com/image/fetch/$s_!AHyg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 848w, https://substackcdn.com/image/fetch/$s_!AHyg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 1272w, https://substackcdn.com/image/fetch/$s_!AHyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed014fbe-a821-4363-bdd5-c21e4afc8981_323x34.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The unexpected consequence of the ALTER statement above is that the c2_c1_index will be dropped as well. After all, two columns are required to populate index entry &#8211; without c1, it&#8217;s no longer possible. Both CockroachDB and Postgres drop the c2_c1_index silently, which may negatively impact the queries that rely on this index (to quickly filter by c2).</p><p>The situation wouldn't be as bad if c2 were dropped instead of c1, because an index is normally associated with the first column in its definition. Therefore, we shouldn&#8217;t prevent indices from being dropped if its first column (c2 in this case) is dropped. Well, unless the latter column is in use by the application.</p><h3>Example 3: Keeping Two Tables In Sync</h3><p>Here&#8217;s an example where the schema definition could result in suboptimal behavior that may only reveal itself under sizable load.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EXtO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EXtO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 424w, https://substackcdn.com/image/fetch/$s_!EXtO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 848w, https://substackcdn.com/image/fetch/$s_!EXtO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 1272w, https://substackcdn.com/image/fetch/$s_!EXtO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EXtO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png" width="1456" height="317" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:317,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EXtO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 424w, https://substackcdn.com/image/fetch/$s_!EXtO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 848w, https://substackcdn.com/image/fetch/$s_!EXtO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 1272w, https://substackcdn.com/image/fetch/$s_!EXtO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22653c15-e5b2-4ff2-8384-f8181a838f8c_1600x348.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>ON UPDATE CASCADE instructs the DBMS (CockroachDB, Postgres) to keep the value of fk in sync with the pk value of the specified table. An update of other_table.pk will automatically change bad_3.fk. We can imagine that the following statement is executed in the same transaction that changes other_table.pk.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x5tQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x5tQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 424w, https://substackcdn.com/image/fetch/$s_!x5tQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 848w, https://substackcdn.com/image/fetch/$s_!x5tQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 1272w, https://substackcdn.com/image/fetch/$s_!x5tQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x5tQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png" width="484" height="37" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:37,&quot;width&quot;:484,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x5tQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 424w, https://substackcdn.com/image/fetch/$s_!x5tQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 848w, https://substackcdn.com/image/fetch/$s_!x5tQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 1272w, https://substackcdn.com/image/fetch/$s_!x5tQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0485a935-5c0f-4517-ba2e-2ec10aa144e6_484x37.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This may work fine when our data volume is low or when such requests are infrequent. However, upon a spike of changes in other_table.pk, the DBMS will have to change bad_3 values respectively. And, due to the absence of an index for fk, it will have to run a <strong>full scan</strong> for every one of them. Which will lead to rapid performance degradation &#8211; possibly an outage if the table contains a large number of rows.</p><h3>Example 4: Potential Runtime Failure</h3><p>And finally, a simple one.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3OzJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3OzJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 424w, https://substackcdn.com/image/fetch/$s_!3OzJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 848w, https://substackcdn.com/image/fetch/$s_!3OzJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 1272w, https://substackcdn.com/image/fetch/$s_!3OzJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3OzJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png" width="1456" height="208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b84e238-d421-4083-b270-d3d34296e854_1600x229.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3OzJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 424w, https://substackcdn.com/image/fetch/$s_!3OzJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 848w, https://substackcdn.com/image/fetch/$s_!3OzJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 1272w, https://substackcdn.com/image/fetch/$s_!3OzJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b84e238-d421-4083-b270-d3d34296e854_1600x229.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Is it safe to execute a drop of a table, column, or index? It&#8217;s not possible to answer without knowing the context. There can still be references in the application code, or even some 3rd-party system that generates monthly reports. We should use the history of queries to some_table to know with a certain degree of confidence.</p><h2><strong>Nuance:</strong> A Better Schema Analysis System</h2><p>We should detect the issues discussed above, and more, as part of our everyday development flow. This would maintain developer velocity, ensure that past outages are not repeated, and eliminate the need for manual intervention by infrastructure maintainers. We built <strong>Nuance:</strong> a schema analysis system to accomplish exactly this!</p><p>The idea to analyze and lint schemas is not new. There are a few open-source products that attack this problem. However, they focus primarily on syntax, while we aim to address issues related to the runtime usage of the database. Usability depends on the signal-to-noise ratio &#8211; ideally, we want to confidently highlight actual problems rather than focus on indentation and naming conventions. This requires a broader perspective, as the database schema alone is not sufficient to identify problems.</p><p>A high-level diagram of the analysis module is presented below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UFhr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UFhr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 424w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 848w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1272w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UFhr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png" width="1192" height="572" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:572,&quot;width&quot;:1192,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UFhr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 424w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 848w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1272w, https://substackcdn.com/image/fetch/$s_!UFhr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc720edf4-4894-47f4-a4b2-f7bd04f88360_1192x572.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At this point, it&#8217;s worth noting multiple data sources that are used to produce facts about both the database schema and the execution environment.</p><h3>Production Database Schema &amp; Migration Library Metadata</h3><p>Since any schema upgrade will eventually be applied to a running database, it doesn&#8217;t make sense to consider any schema other than the one that is actually deployed to production. It is always possible that the schema committed to the company&#8217;s version control system diverges from the actual one, whether due to early experimentation or ad-hoc mitigations. This should be taken into account when assessing whether the upgrade is safe.</p><p>For example, <a href="https://flywaydb.org/">Flyway</a>, a commonly used migration tool, keeps the previously applied versions in a special table (flyway_schema_history) assumed to be in sync with the updates committed to a version control system. Nuance can validate this invariant and warn when discovering drift. CockroachDB and Postgres, like most database technologies, easily allow schema dumping, which is then parsed, transformed to AST, and used to derive necessary facts.</p><h3>Schema Upgrade Statement</h3><p>These are the actual SQL commands executed against a production database to migrate it to the target schema. Since we store schemas as code, these can be extracted from the developer&#8217;s pull request.</p><p>Depending on the chosen upgrade path, this can either be a single file with statements, or a list of newly introduced versioned files with incremental updates. Our implementation differentiates these approaches in order to support catching issues where migration tooling depends on whether a change is performed via a single file or multiple.</p><h3>Database Runtime Information &amp; Query Log</h3><p>Modern storage technologies expose a lot of information via metrics and system tables. A few examples of the latter are <a href="https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS">cumulative statistics</a> in Postgres and <a href="https://www.cockroachlabs.com/docs/stable/crdb-internal">crdb_internal</a> in CockroachDB.</p><p>Additionally, a full log of queries executed against the target database is also extremely powerful. This enables us to assess the ongoing usage of certain entities, whether they be tables or columns. This can also be applied to other use cases, such as access audits and performance monitoring.</p><p>We use a system built on top of <a href="https://clickhouse.com/">ClickHouse</a> to store, process, and aggregate both runtime information and the query log of the entire storage fleet. This provides us with a summary of database usage for a given time period. (We also use the same system to monitor and debug performance issues, detect suboptimal transactions, and submit recommendations on resource tuning.)</p><h3>Other Data Sources</h3><p>We plan to eventually add more data sources. For example, consider the impact of a schema upgrade on <strong>derived systems</strong> such as database changefeeds, Kafka topics (plus other <a href="https://techblog.citystoragesystems.com/p/reliable-order-processing">message queue technologies</a>), various OLAP consumers, and more. Theoretically, we could fetch this information and use it in the analysis to determine whether there could be a failure further down the pipeline.</p><p>It&#8217;s clear at this point that a well-designed system allows for the easy integration of new data sources with little to no impact on existing functionality. Such a design requires a certain level of abstraction over the utilized data sources. More on that below.</p><h3>Datalog and Fact Store</h3><p>Instead of delivering data in an aggregated state specific to the source (e.g., a full database schema snapshot as SQL or AST), we can split it into individual facts (e.g., table X has column Y, column Y is of type INT, column Y is NULL-able, index IDX was used N days ago, etc.). This becomes a "database" of facts that can be queried uniformly.</p><p>With a database of facts, the next step is to implement rules that extract interesting properties from it (problems or recommendations) &#8211; optionally unwinding multi-level relationships between individual facts.&nbsp;</p><p>This approach is a natural fit for logical languages, particularly <a href="https://en.wikipedia.org/wiki/Prolog">Prolog</a> and <a href="https://en.wikipedia.org/wiki/Datalog">Datalog</a>. We use the latter and rely on the Datalog implementation by <a href="https://github.com/google/mangle">Google Mangle</a> project.&nbsp;</p><p>During validation, information extracted from the data sources above is converted into facts (atoms) that are stored in an in-memory fact store. Individual rules query this to check for the existence of interesting properties. Successful rule evaluations, depending on the purpose of the rule, indicate either a potential issue or an optimization opportunity. All atoms in the store are unique; therefore, by using atom terms (arguments), it&#8217;s possible to point to the specific change and improve developer awareness.</p><h3>Putting It All Together</h3><p>Let&#8217;s see how it&#8217;s done in practice. Below is a Datalog source code of a rule that checks that no nullable column is a part of the UNIQUE constraint (Example 1). In our codebase it&#8217;s associated with a Code that uniquely identifies the rule itself, warn-level Severity (not good, but not necessarily outage-inducing), Unit to define the top-level rule and accompanying sub-rules (presented below), and the Predicate to indicate the name of the top-level rule (uniq0001).</p><p>The analysis engine will evaluate the top-level rule, and if it produces results, then there&#8217;s indeed an issue. We can generate its human-readable description by using the accompanying template that receives terms (arguments) of result atoms that uniq0001 produces. This description will point to the exact entity that has a problem (Table and Column).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9edc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9edc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 424w, https://substackcdn.com/image/fetch/$s_!9edc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 848w, https://substackcdn.com/image/fetch/$s_!9edc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 1272w, https://substackcdn.com/image/fetch/$s_!9edc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9edc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png" width="1456" height="927" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:927,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9edc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 424w, https://substackcdn.com/image/fetch/$s_!9edc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 848w, https://substackcdn.com/image/fetch/$s_!9edc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 1272w, https://substackcdn.com/image/fetch/$s_!9edc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba82da6b-6b37-4037-8997-eab8250d0b16_1600x1019.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s try to solve Example 2. The rule below checks for situations when an ALTER TABLE DROP COLUMN statement tries to drop a column that is secondary to some index. &#8220;Secondary&#8221; here means that the column is either not the first one in the declaration list or is a column that&#8217;s only stored in an index for faster retrieval (e.g. a covering index for index-only scans).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0HCh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0HCh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 424w, https://substackcdn.com/image/fetch/$s_!0HCh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 848w, https://substackcdn.com/image/fetch/$s_!0HCh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 1272w, https://substackcdn.com/image/fetch/$s_!0HCh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0HCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png" width="1456" height="964" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:964,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0HCh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 424w, https://substackcdn.com/image/fetch/$s_!0HCh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 848w, https://substackcdn.com/image/fetch/$s_!0HCh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 1272w, https://substackcdn.com/image/fetch/$s_!0HCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8522b1c-8116-4e23-8dfc-155ac84e164e_1600x1059.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To make things more complicated, imagine a chain of tables that point to one another via foreign keys (<a href="https://www.postgresql.org/docs/current/tutorial-fk.html">official reference</a>). A CASCADE drop of the first column in a chain may trigger respective drops in all linked tables, which at some point may lead to exactly the same issue (unwanted index drop). A more sophisticated rule (only a part of it is presented here for brevity) may check for that as well. (This is also applicable to Example 3.)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c0wk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c0wk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 424w, https://substackcdn.com/image/fetch/$s_!c0wk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 848w, https://substackcdn.com/image/fetch/$s_!c0wk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 1272w, https://substackcdn.com/image/fetch/$s_!c0wk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c0wk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png" width="1456" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c0wk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 424w, https://substackcdn.com/image/fetch/$s_!c0wk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 848w, https://substackcdn.com/image/fetch/$s_!c0wk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 1272w, https://substackcdn.com/image/fetch/$s_!c0wk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010d39bf-a321-4ab6-a626-74a95ff814a0_1600x703.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Finally, having table / column / index usage information from recent queries (transformed into respective facts) enables us to trivially solve Example 4. All we need to do is check whether the table has been accessed within the last N days (where N is sufficiently large).</p><p>As we can see, Datalog is pretty expressive. It has become evident during the implementation that any generic approach based on more common programming languages (e.g., Golang) would need to implement some portion of the Datalog engine anyway &#8211; so we&#8217;re happy with our choice!</p><h2>Developer Experience</h2><p>Nuance appears to developers as just another lint check triggered by a new or updated pull request. If there are no issues, the pull request can be merged as usual. Otherwise, we block it and present resolution opportunities.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NEP8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NEP8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 424w, https://substackcdn.com/image/fetch/$s_!NEP8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 848w, https://substackcdn.com/image/fetch/$s_!NEP8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 1272w, https://substackcdn.com/image/fetch/$s_!NEP8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NEP8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png" width="1456" height="863" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:863,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NEP8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 424w, https://substackcdn.com/image/fetch/$s_!NEP8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 848w, https://substackcdn.com/image/fetch/$s_!NEP8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 1272w, https://substackcdn.com/image/fetch/$s_!NEP8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97ba2b6-6091-4dda-b9e2-11bcf3476e11_1600x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Not all detected issues require immediate action. Sometimes, users are confident in their changes, or the detected issues are merely informative. To accommodate these scenarios, we left an escape hatch enabling users to acknowledge and skip the detected issues.&nbsp;</p><h3>Command Line Tooling</h3><p>We&#8217;ve augmented Nuance with two important CLI commands.</p><p>The validate command enables developers to submit their SQL scripts for analysis and issue detection. Behind the scenes, this command spins up the same validation process that runs for pull requests. This includes fetching the production or staging schema and retrieving real-time query data, ensuring a comprehensive analysis.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pNMC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pNMC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 424w, https://substackcdn.com/image/fetch/$s_!pNMC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 848w, https://substackcdn.com/image/fetch/$s_!pNMC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 1272w, https://substackcdn.com/image/fetch/$s_!pNMC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pNMC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png" width="1380" height="335" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:335,&quot;width&quot;:1380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pNMC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 424w, https://substackcdn.com/image/fetch/$s_!pNMC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 848w, https://substackcdn.com/image/fetch/$s_!pNMC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 1272w, https://substackcdn.com/image/fetch/$s_!pNMC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7602890a-0305-4d1f-ba9c-d283c2586490_1380x335.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The inspect command inspects the live database schema for any existing issues.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Hkd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Hkd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 424w, https://substackcdn.com/image/fetch/$s_!4Hkd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 848w, https://substackcdn.com/image/fetch/$s_!4Hkd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 1272w, https://substackcdn.com/image/fetch/$s_!4Hkd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Hkd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png" width="1378" height="232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:232,&quot;width&quot;:1378,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Hkd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 424w, https://substackcdn.com/image/fetch/$s_!4Hkd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 848w, https://substackcdn.com/image/fetch/$s_!4Hkd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 1272w, https://substackcdn.com/image/fetch/$s_!4Hkd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97f6bbeb-11ba-4b70-805e-3bd10d3eab27_1378x232.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Both CLI commands make it easy to experiment, analyze, and detect issues before developers even submit pull requests.</p><h3>Documentation</h3><p>When presenting an issue, along with a brief description, we always include a link to more detailed documentation. The issue page typically contains a longer description, examples, extracts from official documentation, code snippets, and links to past outages and their associated postmortems. This is critical for the smooth adoption of Nuance. By educating developers and improving their awareness, we teach them to trust the tool, preventing it from being viewed as a noisy distraction.</p><h2>Results</h2><p>At CloudKitchens, Nuance has been running in production for almost a year. In its current state, it has accommodated past outage history and checks for common anti-patterns and flaws described in Postgres (e.g., <a href="https://wiki.postgresql.org/wiki/Don't_Do_This">here</a>) and <a href="https://www.cockroachlabs.com/docs/stable/">CockroachDB documentation</a>.</p><p>Current statistics show that 56% of our pull requests with schema changes initially contain issues, 23% of which are critical (guaranteed to cause outages, either immediately or later on). When developers become aware of the issues through tooling, we no longer see outages repeat themselves. As our infrastructure evolves, we will continuously improve and augment our ruleset.</p>]]></content:encoded></item><item><title><![CDATA[Multi-Channel Multi-Location Menu Management]]></title><description><![CDATA[How Atoms enables seamless management of menus across multiple channels and locations]]></description><link>https://techblog.atoms.co/p/multi-channel-multi-location-menu</link><guid isPermaLink="false">https://techblog.atoms.co/p/multi-channel-multi-location-menu</guid><pubDate>Mon, 16 Sep 2024 16:43:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/37f6eb20-b5db-4dca-ad68-6ae96b3c7898_3840x2160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9sWX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9sWX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!9sWX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!9sWX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!9sWX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9sWX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141226,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9sWX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!9sWX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!9sWX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!9sWX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452626f7-e59a-47f6-a580-37a9cde00e88_3840x2160.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong>Introduction</strong></h1><p>In the modern age, orders are placed from multiple channels &#8211; delivery apps, online ordering storefronts, in-store, and more &#8211; so having a best-in-class digital menu is equally as critical as a restaurant&#8217;s physical storefront. We build technology that allows restaurants to easily create and manage their menus in both the physical and digital space, enabling them to succeed across all channels.</p><h1><strong>Menus are easy</strong></h1><p>The concept of a menu is simple. It is a catalog of food the restaurant can make and its associated prices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!90yO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!90yO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 424w, https://substackcdn.com/image/fetch/$s_!90yO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 848w, https://substackcdn.com/image/fetch/$s_!90yO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!90yO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!90yO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png" width="1311" height="1600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1600,&quot;width&quot;:1311,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!90yO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 424w, https://substackcdn.com/image/fetch/$s_!90yO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 848w, https://substackcdn.com/image/fetch/$s_!90yO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!90yO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4557d305-733e-4a4c-b30c-65338bcb0bed_1311x1600.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>&#8230;But menus aren&#8217;t so easy</em></p><p>There&#8217;s more to a menu than meets the eye. Let&#8217;s look at a basic digital menu designed to mirror a restaurant&#8217;s physical menu.</p><p>Much like their physical counterparts, menus are expected to have some structure.</p><p>Generally, this means menus have different categories of items they can sell. &#8220;Entrees&#8221;, &#8220;Sides&#8221;, and &#8220;Drinks&#8221;, with items organized under those categories. Some restaurants may offer add-ons on specific items, like adding a side of fries to a burger. Some of these add-ons may even have their own add-ons, such as adding melted cheese to the side of fries. Some restaurants may even have different menus they sell depending on the time of day &#8211; a lunch menu and a dinner menu, for example.</p><h1><strong>How CloudKitchens helps restaurants create user-friendly menus across channels</strong></h1><p>When we learned that menus <em>aren&#8217;t easy</em>, we began building solutions to make setting up and managing menus across multiple channels a better experience for the restaurateur. Since a good menu requires categories, add-ons, add-ons to the add-ons, and perhaps even a different menu depending on the time of day, we developed a primary set of four entity types that we store in a tree-like structure to model these menus: Menus, categories, items, and modifier groups.<br><br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MtpQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MtpQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!MtpQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!MtpQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!MtpQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MtpQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MtpQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!MtpQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!MtpQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!MtpQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce9f4086-147b-4543-bc0b-ae12ed92c8c4_1600x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Separating these entities allows us to store specific information for them. For example, items will often have photos and prices. Menus will have the hours they are sold. Modifier groups may restrict the number of underlying modifier items that may be selected.&nbsp;</p><p>In our model, we allow entities to be re-used without creating duplicates. This lets us reduce the workload for restaurateurs managing a menu. Information about an entity must only be updated once and will appear everywhere the entity is referenced.</p><h1><strong>The CloudKitchens digital menu solution in action&nbsp;</strong></h1><p>Let&#8217;s look at an example. Say we&#8217;re making a menu for a fast-food burger chain: Burger Prince. Many items can add sauces (burgers, fries, onion rings, etc), but they are all the same. Rather than specifying every possible sauce for every item, we can create a single modifier group titled &#8220;Add a sauce&#8221; with all the sauces underneath and reference the same modifier group several times. If we would like to introduce a new sauce, we can just add it once to the &#8220;Add a sauce&#8221; modifier group, and it will be available on every item with the modifier group attached.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y3Pq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y3Pq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 424w, https://substackcdn.com/image/fetch/$s_!y3Pq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 848w, https://substackcdn.com/image/fetch/$s_!y3Pq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 1272w, https://substackcdn.com/image/fetch/$s_!y3Pq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y3Pq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png" width="1327" height="809" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:809,&quot;width&quot;:1327,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y3Pq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 424w, https://substackcdn.com/image/fetch/$s_!y3Pq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 848w, https://substackcdn.com/image/fetch/$s_!y3Pq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 1272w, https://substackcdn.com/image/fetch/$s_!y3Pq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa577c62e-a38f-4cc0-8bbf-b393cd90f97f_1327x809.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While this helps in most cases, it also adds some complexity if entities should have different properties depending on where they appear on a menu. This is most common when it comes to pricing.</p><p>Let&#8217;s look at another example from our existing burger restaurant, Burger Prince. &#8220;French Fries&#8221; can be ordered as a stand-alone item on the menu or as an added-on for a burger combo meal. When ordered as a stand-alone item, French Fries should cost $2, but when ordered as part of a combo, they go down to $1.</p><p>To support this, we built menu-path-based pricing for the &#8220;French Fries&#8221; item, so we can set different prices depending on whether it&#8217;s ordered as part of the modifier group for a burger combo or from the category of the menu.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ybf3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ybf3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 424w, https://substackcdn.com/image/fetch/$s_!Ybf3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 848w, https://substackcdn.com/image/fetch/$s_!Ybf3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!Ybf3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ybf3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png" width="1456" height="1023" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb867401-b41b-44af-bbe5-22173153573c_1600x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1023,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ybf3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 424w, https://substackcdn.com/image/fetch/$s_!Ybf3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 848w, https://substackcdn.com/image/fetch/$s_!Ybf3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!Ybf3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb867401-b41b-44af-bbe5-22173153573c_1600x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong>Menus are hard&nbsp;</strong></h1><p>Now that we&#8217;ve examined how we model menus internally let&#8217;s add another layer of complexity. </p><p><strong>Adapting menus across different channels</strong></p><p>While we offer an online, <a href="https://tryotter.com/products/online-ordering">direct-to-consumer storefront</a>, the menus created by our technology will likely also appear on external delivery channels (e.g., Uber Eats, Doordash, Grubhub) so the restaurant can take orders through those channels.&nbsp;&nbsp;</p><p>Because every delivery channel models their menus differently and has different information they store per entity, we must be able to adapt our internal menu to theirs and vice versa. This requires our menu model to be the superset of all information available on these different channels while still not overcomplicating things for &#8203;&#8203;restaurateurs.</p><p>Each delivery channel may support different features compared to our internal menu. For example, some channels may not support nested modifiers (add-ons to add-ons), meaning we&#8217;ll have to copy and automatically unnest modifiers when we convert our menu data to the external service's format. Some channels may not support the same item appearing in a category and as an add-on to another item. For those cases, we have to split the item into two copies on the external menu, which link back to the same item on our internal menu.</p><p>We have a complex integration suite for adapting internal menus to external menus, which we&#8217;ll discuss in detail in a future post.</p><p><strong>Managing menus across different channels</strong></p><p>Since delivery channels often charge restaurants a commission on every order, a restaurateur may decide to alter their menu pricing depending on the ordering channel. This means we must now model differences in prices across delivery channels and prices based on where an item appears on a menu (e.g., a standalone order of fries or fries added onto a combo).&nbsp;</p><p>One of the core features we offer customers is inventory management &#8211; the ability to mark items as out of stock when they are sold out or mark items as back in stock when they do have inventory. This allows restaurants to quickly toggle which items are being sold without logging in to each channel and editing every menu during day-to-day operations.&nbsp;</p><p>For this feature to be effective, once a user marks an item as available or unavailable on our internal menu, we must quickly and accurately propagate this information to all connected delivery channels. Hence, they mark the equivalent items on their channels. We do this by storing links between our internal entities and every external entity at a published channel so our system can propagate the availability update for the correct item.</p><p>We have a complex integration suite to adapt internal menus to external menus, which is a deep rabbit hole, so we&#8217;ll save those details for a future post.</p><h1><strong>Menus at scale are really hard</strong></h1><p>But let&#8217;s take it a step even further. Many of our customers aren&#8217;t just operating on multiple channels but also at multiple locations. Menus are usually the same across these locations, so we do not need to manage a copy of every menu. <br><br>Let&#8217;s say Burger Prince had 50 locations. If we wanted to start selling a new seasonal item at each location, we&#8217;d have to go to every single location&#8217;s menu and make the same change 50 times. That would be a lot of work!</p><p><strong>Meet template menus</strong></p><p>To alleviate the issue above, we created <strong>template menus</strong>. Template menus are meant to model a digital menu that can be used at multiple locations, so restaurants managing a menu sold at multiple locations can make all their changes once on a single template menu.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LIV0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LIV0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 424w, https://substackcdn.com/image/fetch/$s_!LIV0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 848w, https://substackcdn.com/image/fetch/$s_!LIV0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 1272w, https://substackcdn.com/image/fetch/$s_!LIV0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LIV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png" width="1405" height="1391" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1391,&quot;width&quot;:1405,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LIV0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 424w, https://substackcdn.com/image/fetch/$s_!LIV0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 848w, https://substackcdn.com/image/fetch/$s_!LIV0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 1272w, https://substackcdn.com/image/fetch/$s_!LIV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05da7c6b-9f63-4b34-962a-17a8f4544eb9_1405x1391.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While locations sharing a template menu are more likely than not to share information, many aspects of the menu can be customized per location. This means that sources of complexity, like per-channel pricing, are potentially multiplied by the number of locations using a template menu.</p><p><strong>Menu differences across multiple locations&nbsp;</strong></p><p>By default, a single price may be set for an item on a template menu, and that same price will propagate to all locations. However, menu items are often priced differently for different locations, so we have to manage price per location, price per delivery channel, and price based on where the item appears.</p><p>A restaurateur managing a template menu may want to mark an item as unavailable across all locations. Also, it&#8217;s common for a restaurant to run out of stock of an item for the day at just a few locations, so the item should only be marked unavailable at a subset of locations.</p><p>This means we must store a link from each item on the template menu to every item across all delivery channels for each location so we can manage its availability at any number of locations with a single button click.&nbsp;</p><p>Menus can also be generally different per location. Some items or categories may be sold at some locations and not others. For example, global fast-food franchises usually sell a few specialty items based on the region in which the restaurant is operating.</p><p><strong>Menus are still just one part of the larger picture</strong></p><p>We&#8217;ve explored some of the complexities menus may present, but we must remember that menus are only one part of the restaurant ecosystem.</p><p>In addition to creating an easy-to-use menu management experience, our menu system must integrate with many other systems, such as online order management, POS (point of sale), KDS (kitchen display systems), printers, business intelligence reporting, analytics, promotions, and inventory/supply chain.</p><p>Here&#8217;s an example of a simplified order interaction involving all these pieces:<br><br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mIwM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mIwM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 424w, https://substackcdn.com/image/fetch/$s_!mIwM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 848w, https://substackcdn.com/image/fetch/$s_!mIwM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 1272w, https://substackcdn.com/image/fetch/$s_!mIwM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mIwM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png" width="1456" height="543" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:543,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mIwM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 424w, https://substackcdn.com/image/fetch/$s_!mIwM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 848w, https://substackcdn.com/image/fetch/$s_!mIwM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 1272w, https://substackcdn.com/image/fetch/$s_!mIwM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2ad900c-e6f2-484b-b712-fd321aa94a3f_1600x597.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At a high level, when an order comes in from a third-party channel like DoorDash, Uber Eats, or Grubhub, our system must be able to properly match the items in the order to items on the internal menu. These orders must then be converted to a kitchen-readable format and sent to the KDS (kitchen display systems) or printers for chefs to prepare. At the same time, if the storefront has a physical point-of-sale device for taking orders, we must also inject any online orders into their point-of-sale device so they can manage their entire restaurant from one channel. On top of that, other integrated apps like business intelligence or inventory management must also be able to understand what items are being ordered and consumed so we can properly decrease inventory count for our users.</p><p>Each of these interactions is very complex, so stay tuned for the next few posts, where we&#8217;ll explore the details of some of these challenges.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://techblog.atoms.co/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading City Storage Systems! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[From Fragile to Faultless: Kubernetes Self-Healing In Practice]]></title><description><![CDATA[Overcoming imperfections of managed Kubernetes with early self-healing.]]></description><link>https://techblog.atoms.co/p/kubernetes-self-healing</link><guid isPermaLink="false">https://techblog.atoms.co/p/kubernetes-self-healing</guid><pubDate>Sat, 15 Jun 2024 14:01:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pIbX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pIbX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pIbX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 424w, https://substackcdn.com/image/fetch/$s_!pIbX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 848w, https://substackcdn.com/image/fetch/$s_!pIbX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 1272w, https://substackcdn.com/image/fetch/$s_!pIbX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pIbX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png" width="1456" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:274179,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pIbX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 424w, https://substackcdn.com/image/fetch/$s_!pIbX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 848w, https://substackcdn.com/image/fetch/$s_!pIbX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 1272w, https://substackcdn.com/image/fetch/$s_!pIbX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11e401f2-cdd8-440b-89eb-7f6eaeeb3aae_1689x990.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by Grzegorz G&#322;&#261;b and Nibir Bora, members of the engineering teams that work on core infrastructure.</em></p><p>Many organizations opt for managed Kubernetes distributions like Azure Kubernetes Service (AKS) to get up and running quickly without needing a large engineering team to operate Kubernetes clusters. This is a core design principle of the Core Infrastructure team at City Storage Systems. However, over the years, we&#8217;ve learned that the true operating cost of managed Kubernetes distributions is not in fact zero.</p><p>Even public cloud experiences occasional failures. Hardware faults, kernel misconfigurations, network bottlenecks, problematic rollouts, resource scarcity, security vulnerability, etc. leads to complications lasting for minutes, or in some cases, weeks. In this blog we share our experience illustrating how minor glitches, if left unattended, could quickly escalate and impact business continuity.</p><p>Rather than engaging in constant firefighting we designed a self-healing framework, often implementing automations with a turnaround time of as little as 1 day. These automations were sometimes temporary fixes until resolved by the cloud provider, and at other times, they became permanent enhancements to our platform&#8217;s reliability. While our journey began with a focus on AKS, this framework is a general-purpose pattern to improve resilience of any Kubernetes platform.</p><h1>The Self-Healing Framework</h1><p>The first self-healing use case was implemented as a monolithic program. But, as we added new use cases, we identified several reusable libraries that nudged us to organize it into a framework. The framework today consists of <em>Automation</em>s that each address a specific failure mode. Automations are implemented as an independent <em>Detector</em> and a <em>Fixer</em>, which are either a <a href="https://kubernetes.io/docs/concepts/architecture/controller/">controller</a> or a <a href="https://go.dev/">Go</a> program.</p><p><em>Detectors</em> are responsible for collecting signals and flagging failure conditions. There are two types of detectors - Cluster level (Deployment) and Node level (DaemonSet). Cluster level detectors monitor for cluster-wide failure events and have permissions to watch or create API server resources. Node level detectors monitor for node level failures (e.g. misconfigured OS flags, image pull issues, missing systemd services, etc.) and have privileged host access.</p><p><em>Fixers</em> complement detectors by executing remediation steps to rectify or cleanup failure states. Similar to Detectors, there are two types of fixers - Cluster level (Deployment) and Node level (DaemonSet). Cluster level fixers execute remediation actions that operate on cluster-level resources and have permissions to watch API server resources. Node level fixers execute remediation action at node level, or operations that require privileged host access.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VRCh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VRCh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 424w, https://substackcdn.com/image/fetch/$s_!VRCh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 848w, https://substackcdn.com/image/fetch/$s_!VRCh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 1272w, https://substackcdn.com/image/fetch/$s_!VRCh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VRCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png" width="1456" height="755" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:755,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VRCh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 424w, https://substackcdn.com/image/fetch/$s_!VRCh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 848w, https://substackcdn.com/image/fetch/$s_!VRCh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 1272w, https://substackcdn.com/image/fetch/$s_!VRCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed218a4-bc4e-4a4d-be5e-458df638bde0_1600x830.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Diagram showing architecture of the self-healing framework operating in a Kubernetes cluster.</figcaption></figure></div><p>Generalizing as such allows us to keep the framework simple and appropriately isolate permissions. This was key to swiftly adding new automations when needed. Whenever we identify a new degradation, we implement and deploy the corresponding detector and fixer across all our clusters. The following automations are a few examples that shielded our internal developers and applications from potential impact, and also significantly reduced our team&#8217;s support toil - cutting it in half from 30% of engineering time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cy4F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cy4F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 424w, https://substackcdn.com/image/fetch/$s_!Cy4F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 848w, https://substackcdn.com/image/fetch/$s_!Cy4F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 1272w, https://substackcdn.com/image/fetch/$s_!Cy4F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cy4F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png" width="1456" height="743" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:743,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154548,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cy4F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 424w, https://substackcdn.com/image/fetch/$s_!Cy4F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 848w, https://substackcdn.com/image/fetch/$s_!Cy4F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 1272w, https://substackcdn.com/image/fetch/$s_!Cy4F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f1cacbd-c468-46ca-a8fd-928b02eb2362_2290x1169.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: List of self-healing automations that are currently active on our Kubernetes platform.</figcaption></figure></div><p>Building these automations over the last year and a half has taught us some key lessons:</p><ol><li><p><strong>Kubernetes is not the end product</strong>. It is a framework for building platforms. Managed Kubernetes still greatly benefits from business specific customizations that create leverage for developers.</p></li><li><p><strong>Abstractions don&#8217;t erase underlying layers</strong>. Kubernetes, while powerful, still requires us to dive into the host VM or kernel layer for deep debugging - a skill set we've honed through experience.</p></li><li><p><strong>Cost optimization has a reliability tax</strong>. Optimizing cost needs to be counterbalanced with increased vigilance on reliability. For example, running stateful workloads on Spot nodes required us to invest further in automation.</p></li><li><p><strong>Cloud bugs cannot be predicted</strong>. Instead of anticipating imaginary failure scenarios, it&#8217;s better to optimize for speed of diagnosing unforeseen issues and implementing automation for them. For example, we consolidated all node malfunction signals on a single &#8220;node inspector&#8221; dashboard empowering our developers to respond swiftly when paged.</p></li></ol><p>In the following sections we describe a few of the automations in detail, covering how each failure mode was identified and how we automated its self-healing.</p><h1>Handling Abrupt Spot Node Preemptions</h1><p>We use Spot nodes extensively on our Kubernetes platform to optimize resource costs, running both stateless and less critical stateful workloads. However, Spot nodes on AKS <a href="https://learn.microsoft.com/en-us/azure/aks/spot-node-pool">lack any SLA</a>, which can lead to potential abrupt preemptions. We experienced an incident where a large number of Spot node preemptions caused multiple stateful workloads to fail, causing cascading application failures resulting in downtime.</p><p>When Spot nodes on AKS are preempted, a <em>scheduled preemption</em> event is emitted 30 seconds before the underlying VM is abruptly removed. The node isn&#8217;t cordoned, workloads aren&#8217;t gracefully shut down, and the Node isn&#8217;t deregistered from the Kubernetes API server. The Node object remains without a physical VM (see issue <a href="https://github.com/Azure/AKS/issues/3528">#3528</a>) until cleaned up after 5 minutes due to failed heartbeats. When this happens, stateless workload pods (controlled by Deployment and ReplicaSet) are automatically rescheduled, but not StatefulSet pods. StatefulSet pods leave behind &#8220;phantom&#8221; pod objects (with <code>.status.phase: Unknown</code>) in the API server, which is not an acceptable behavior for our stateful workloads.</p><p>To address this, we implemented a self-healing automation that intercepts Spot node preemption signals and gracefully evicts all pods on the affected node. A detector watches for <code>VMEventScheduled</code> node conditions (example below) and creates a <code>SpotNodePreemption</code> <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">Custom Resource</a> (CR) with details for the fixer. The fixer then evicts the pods with a 10-second grace period.</p><pre><code>.status.conditions: [
    {
        status: "True",
        type: "VMEventScheduled",
        reason: "VMEventScheduled",
        message: "Preempt Scheduled : Tue, 14 May 2024 12:57:00 GMT",
        lastHeartbeatTime: "2024-05-14T12:56:43Z",
        lastTransitionTime: "2024-05-14T12:56:42Z"
    }
]</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S0zO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S0zO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 424w, https://substackcdn.com/image/fetch/$s_!S0zO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 848w, https://substackcdn.com/image/fetch/$s_!S0zO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 1272w, https://substackcdn.com/image/fetch/$s_!S0zO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S0zO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png" width="1456" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S0zO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 424w, https://substackcdn.com/image/fetch/$s_!S0zO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 848w, https://substackcdn.com/image/fetch/$s_!S0zO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 1272w, https://substackcdn.com/image/fetch/$s_!S0zO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3194da73-7ca8-46be-9e44-cfc97b9dc784_1496x226.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 3: Example timeline of Kubernetes events for a pod that was scheduled on a preempted Spot node.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9uBu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9uBu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 424w, https://substackcdn.com/image/fetch/$s_!9uBu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 848w, https://substackcdn.com/image/fetch/$s_!9uBu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 1272w, https://substackcdn.com/image/fetch/$s_!9uBu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9uBu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png" width="938" height="292" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:292,&quot;width&quot;:938,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9uBu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 424w, https://substackcdn.com/image/fetch/$s_!9uBu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 848w, https://substackcdn.com/image/fetch/$s_!9uBu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 1272w, https://substackcdn.com/image/fetch/$s_!9uBu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b40c32-cf98-4ee8-99e1-1f8d2ec37929_938x292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: High volume of spot node preemption around November 2023.</figcaption></figure></div><p>Once this automation was operationalized, we noticed that some Spot nodes were still terminated without a scheduled <em>preemption event</em>. This was because when Node Problem Detector (NPD) queries Azure Metadata Service for the <code>VMEventSchedule</code> event, the request occasionally fails resulting in a <code>NoVMEventScheduled</code> node condition (example below). To handle this, we added another self-healing automation to clean up after terminated Spot nodes when the preemption event wasn&#8217;t intercepted. The detector creates a <code>SpotNodeDeletion</code> CR when a Spot Node object is deleted from the API server, and the fixer force deletes all pod objects on that node assuming they are no longer reachable.</p><pre><code>.status.conditions: [
    {
        ...
        type: "Unknown",
        reason: "NoVMEventScheduled",
        message: &#8220;Timeout when running plugin \"/etc/node-problem-detector.d/plugin/check_scheduledevent_consolidated.sh\": state - signal: killed. output - \"\"&#8221;
    }
]</code></pre><h1>Handling StatefulSet pods on Unreachable Nodes</h1><p>AKS node pools are built on Azure Virtual Machine Scale Sets (VMSS) infrastructure. We observed that VM failures in the VMSS layer often make AKS Nodes unreachable. When this happens, the node controller adds a <code>NoExecute</code> taint, and all pods on the node are evicted after 5 minutes. While stateless pods are rescheduled automatically, StatefulSet pods are not (see issue <a href="https://github.com/kubernetes/kubernetes/issues/54368">#54368</a>, and <a href="https://github.com/kubernetes/design-proposals-archive/blob/main/storage/pod-safety.md#avoid-multiple-instances-of-pods">design proposal</a>). This can lead to data loss caused by under-replication in stateful workloads like CockroachDB or OpenSearch.</p><p>To address this, we implemented a self-healing automation that watches the Kubernetes API server for Node objects with a <code>node.kubernetes.io/unreachable</code> taint. The detector filters nodes tainted for more than 5 minutes, and the fixer force deletes all pods (assuming they are unrecoverable) on these nodes, allowing new pods to be scheduled.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eKwh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eKwh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 424w, https://substackcdn.com/image/fetch/$s_!eKwh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 848w, https://substackcdn.com/image/fetch/$s_!eKwh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 1272w, https://substackcdn.com/image/fetch/$s_!eKwh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eKwh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png" width="936" height="292" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:292,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eKwh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 424w, https://substackcdn.com/image/fetch/$s_!eKwh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 848w, https://substackcdn.com/image/fetch/$s_!eKwh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 1272w, https://substackcdn.com/image/fetch/$s_!eKwh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32f1641e-8000-4905-b1d4-58f265cd61f7_936x292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 5: Daily unreachable nodes detected (last 3 months).</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8ogX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8ogX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 424w, https://substackcdn.com/image/fetch/$s_!8ogX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 848w, https://substackcdn.com/image/fetch/$s_!8ogX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 1272w, https://substackcdn.com/image/fetch/$s_!8ogX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8ogX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png" width="937" height="294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32872e9d-09f0-46bc-9726-069f922664c2_937x294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:294,&quot;width&quot;:937,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8ogX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 424w, https://substackcdn.com/image/fetch/$s_!8ogX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 848w, https://substackcdn.com/image/fetch/$s_!8ogX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 1272w, https://substackcdn.com/image/fetch/$s_!8ogX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32872e9d-09f0-46bc-9726-069f922664c2_937x294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 6: Daily pods deleted by fixer for unreachable nodes (last 3 months).</figcaption></figure></div><h1>Cleaning Up Succeeded and Evicted Pods</h1><p>While investigating a cluster health degradation due to increased etcd disk size, we identified the accumulation of <code>Succeeded</code> pods as a significant factor. These were created by short lived cron jobs, pods without a controller (e.g. Flink jobs), and evicted pods. Since kube-controller-manager doesn't automatically clean up succeeded pods, this is a problem on our large multi-tenant clusters. This default behavior can be modified by configuring the <code>--terminated-pod-gc-threshold</code> flag. However, since we use managed Kubernetes the control plane is managed by the cloud provider and not user-configurable.</p><p>To address this, we implemented a self-healing automation that monitors the Kubernetes API server for pods with, either <code>status.phase = Succeeded</code>, or <code>status.phase = Failed</code> with <code>pod.Status.Reason = Evicted</code>. The detector flags pods that have remained in these phases for at least 15 minutes. This threshold is configurable per namespace. The corresponding fixer deletes these flagged pods from the API server.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xOxB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xOxB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 424w, https://substackcdn.com/image/fetch/$s_!xOxB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 848w, https://substackcdn.com/image/fetch/$s_!xOxB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 1272w, https://substackcdn.com/image/fetch/$s_!xOxB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xOxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png" width="937" height="293" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:293,&quot;width&quot;:937,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xOxB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 424w, https://substackcdn.com/image/fetch/$s_!xOxB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 848w, https://substackcdn.com/image/fetch/$s_!xOxB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 1272w, https://substackcdn.com/image/fetch/$s_!xOxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea333b69-e7d8-4cec-a760-b8a5f68a4e47_937x293.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 7: Daily pods cleaned up by fixer for succeeded &amp; evicted pods (last 3 months).</figcaption></figure></div><h1>Handling Network Packet Drops Due to Unbalanced IRQ</h1><p>We noticed increased packet drop rates in network IO-intensive workloads, initially thought to be application errors. However, we saw that the nodes with affected workloads had <code>VMFreezeEvents</code> (see AKS <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events#event-properties">docs</a>). Investigation showed hardware interrupts from the node&#8217;s network interface were unevenly handled by only 2 of 8 CPU cores, causing 100% utilization on those cores (see detailed investigation in <a href="https://zmalik.dev/posts/packet-drop">blog</a>). Restarting the <code>irqbalance</code> service, which should distribute interrupts evenly, resolved the issue.</p><p>To address this, we implemented a self-healing automation that flags nodes where fewer than half of the CPU cores are configured to handle interrupts from the network interface. This is done by checking <code>/proc/irq/IRQ#/smp_affinity</code>, which denotes CPU core affinity to the interrupt request queue (IRQ). The corresponding fixer restarts the <code>irqbalance</code> systemd service on the host VM. We also expose the number of cores used for IRQ per node as a metric for continued observability. The upstream issue was later fixed in later versions of ubuntu (see bug <a href="https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/2038573">#2038573</a>).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!47vD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!47vD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 424w, https://substackcdn.com/image/fetch/$s_!47vD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 848w, https://substackcdn.com/image/fetch/$s_!47vD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 1272w, https://substackcdn.com/image/fetch/$s_!47vD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!47vD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png" width="936" height="292" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:292,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!47vD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 424w, https://substackcdn.com/image/fetch/$s_!47vD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 848w, https://substackcdn.com/image/fetch/$s_!47vD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 1272w, https://substackcdn.com/image/fetch/$s_!47vD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b76bcb6-3d32-444c-b8de-cd57d93fa7cf_936x292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 8: A recent spike in nodes with Unbalanced IRQ (post upstream fix).</figcaption></figure></div><p>Despite this fix, some packet drops persisted. This was traced to a backlog in the network interface's receive queue. We found if the receive queue size was set to less than 10000 it caused packet drops. To address this, we implemented another automation that flags nodes where <code>net.core.netdev_max_backlog</code> is less than 10000. The corresponding fixer, resets it to 10000 on the host VM.</p><h1>Addressing Failing <code>nftables</code> During OS Image Migration</h1><p>While migrating our nodes from Ubuntu to <a href="https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux">Azure Linux</a> OS, we noticed <a href="https://nftables.org/">nftables</a> wasn&#8217;t running on the migrated nodes. Kubernetes relies on <code>nftables</code> on the host VM for inter-pod routing rules on the node and egress traffic. This prevented Network Policies from being applied correctly, leading to irregular network failure on nodes. After investigation, we identified this was due to a missing newline in the <code>nftables.conf</code> file (see issues <a href="https://github.com/Azure/AKS/issues/4144">#4144</a> and <a href="https://github.com/microsoft/azurelinux/issues/7301">#7301</a>, and pull request <a href="https://github.com/microsoft/azurelinux/pull/8310">#8310</a>).</p><p>To address this, we implemented a self-healing automation that flags nodes where the host VM isn&#8217;t running <code>nftables</code>. The corresponding fixer corrects the nftables.conf file by appending a newline to the end and restarts the <code>nftables</code> systemd service.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rr1T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rr1T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 424w, https://substackcdn.com/image/fetch/$s_!rr1T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 848w, https://substackcdn.com/image/fetch/$s_!rr1T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 1272w, https://substackcdn.com/image/fetch/$s_!rr1T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rr1T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png" width="936" height="294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:294,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rr1T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 424w, https://substackcdn.com/image/fetch/$s_!rr1T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 848w, https://substackcdn.com/image/fetch/$s_!rr1T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 1272w, https://substackcdn.com/image/fetch/$s_!rr1T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3268575d-af8e-40ec-960f-f3e8f7c617a2_936x294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 9: Number of nodes with failing <code>nftables</code> until the fixer was deployed.</figcaption></figure></div><h1>Addressing <code>node-problem-detector</code> Missing on Nodes</h1><p>AKS <a href="https://learn.microsoft.com/en-us/azure/aks/faq#what-is-the-purpose-of-the-aks-linux-extension-i-see-installed-on-my-linux-virtual-machine-scale-sets-instances">runs</a> <a href="https://github.com/kubernetes/node-problem-detector">node-problem-detector</a> (NPD) to monitor <a href="https://learn.microsoft.com/en-us/azure/aks/node-problem-detector">node health</a> and flag for removal during malfunction. It runs 10 checks every 30 seconds and injects the output into node conditions. We integrated these conditions into our observability stack. During a workload failure investigation, we noticed a node had only 4 status conditions instead of the usual 14 (10 from NPD and 4 from kubelet). This led us to discover NPD wasn&#8217;t running on the node. The workload failed because Container Runtime Interface (CRI) malfunctioned on the node preventing kubelet from verifying workload status.</p><p>We implemented a self-healing detector that flags nodes where NPD isn&#8217;t running. Further analysis revealed 25% of our nodes had this issue. Automatically terminating these nodes was deemed too risky. Instead, we rolled back the node OS on all of our nodes to a previously working version and escalated the issue to the cloud provider (see issue <a href="https://github.com/Azure/AKS/issues/3988">#3988</a>). This was later attributed to an upstream <a href="https://security.snyk.io/vuln/SNYK-WOLFILATEST-NODEPROBLEMDETECTOR-5862811">CVE</a> that was fixed. We also set up automated alerts for nodes without NPD to prevent future issues.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bZLy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bZLy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 424w, https://substackcdn.com/image/fetch/$s_!bZLy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 848w, https://substackcdn.com/image/fetch/$s_!bZLy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 1272w, https://substackcdn.com/image/fetch/$s_!bZLy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bZLy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png" width="1456" height="523" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:523,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bZLy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 424w, https://substackcdn.com/image/fetch/$s_!bZLy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 848w, https://substackcdn.com/image/fetch/$s_!bZLy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 1272w, https://substackcdn.com/image/fetch/$s_!bZLy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52189f1e-c380-4a08-acec-21cbf643a276_1600x575.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 10: Nodes without <code>node-problem-detector</code> detected.</figcaption></figure></div><h1>Mitigating <code>ImagePullBackOff</code> Errors for Large Container Images</h1><p>We faced a surge in <code>ImagePullBackOff</code> errors for workloads with large container images (7-10GB). The kubelet error messages (example below) were unhelpful, and workloads failed to start for hours. Manual eviction sometimes helped after multiple retries. An unrelated experiment benchmarking write speeds on <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types">Azure Managed OS disk</a> and <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/ephemeral-os-disks">Ephemeral OS disk</a>, led us to identify that the issues occurred exclusively on nodes with Managed OS disks.</p><pre><code>Oct 31 11:57:43 aks-nodepool0392-17898922-vmss0000LX kubelet[2874]: E1031 11:57:43.120279    2874 remote_image.go:242] "PullImage from image service failed" err="rpc error: code = Canceled desc = failed to pull and unpack image \"cssacrprod.azurecr.io/chronorepo-companion-cron:efdc4a316aebcc878c38483b09bb939524dbd94a\": failed to commit snapshot extract-332345855-2JgL sha256:d2bd2b7dd52900b17c2e8d2f50d94273892a45d96a760f078aeb58bc54fbc160: context canceled" image="cssacrprod.azurecr.io/chronorepo-companion-cron:efdc4a316aebcc878c38483b09bb939524dbd94a"</code></pre><p>We implemented a self-healing detector that flags nodes with <code>ImagePullBackOff</code> errors by parsing kubelet logs. Currently, we lack an automatic fixer. Instead, we emit a custom warning event for each affected pod. Affected workloads can either retry, or if the issue persists, set a node affinity for label <code>ephemeral-storage = true</code>. All nodes in our platform with Ephemeral OS disks have this label.</p><h1>Conclusion</h1><p>Building out a self-healing solution for Kubernetes has allowed us to enhance the reliability of our Kubernetes platform without burdening ourselves with operational and support toil. Automation proved to be the right principle for us to scale to <a href="https://techblog.citystoragesystems.com/p/managing-100s-of-kubernetes-clusters">100s of clusters</a>.</p><p>So, what&#8217;s next? We are constantly adding new detectors and fixers to our self-healing framework. Low level networking, noisy neighbor problems, CPU core use optimizations are few examples of areas we are actively investigating how to automatically detect and rectify problems. Furthermore, we plan on extending the framework beyond platform deficiencies to application deficiencies. We are confident the same mechanics of self-healing are widely applicable. Self-healing is the only answer to having a platform&#8217;s maintenance costs scale sublinearly with business growth. So we&#8217;re serious about investing in it further.</p>]]></content:encoded></item><item><title><![CDATA[Reliable Order Processing]]></title><description><![CDATA[Processing orders in real-time with KEQ while isolating delays and failures]]></description><link>https://techblog.atoms.co/p/reliable-order-processing</link><guid isPermaLink="false">https://techblog.atoms.co/p/reliable-order-processing</guid><pubDate>Fri, 17 May 2024 13:01:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CEif!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CEif!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CEif!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 424w, https://substackcdn.com/image/fetch/$s_!CEif!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 848w, https://substackcdn.com/image/fetch/$s_!CEif!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 1272w, https://substackcdn.com/image/fetch/$s_!CEif!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CEif!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png" width="900" height="599" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:599,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CEif!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 424w, https://substackcdn.com/image/fetch/$s_!CEif!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 848w, https://substackcdn.com/image/fetch/$s_!CEif!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 1272w, https://substackcdn.com/image/fetch/$s_!CEif!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56c3cbf7-ac4a-4def-9492-7126dda97c78_900x599.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by Henning Rohde and Jordan Hurwitz, members of the engineering teams that work on infrastructure.</em></p><p>At Atoms, real-time food order fulfillment is at the heart of our business. When a customer places an order, an elaborate orchestration workflow is run behind the scenes to ensure the meal is cooked, assembled, and delivered efficiently and seamlessly.</p><p>What needs to happen &#8211; and when &#8211; depends on the specifics of the order and is time-sensitive, typically involving 3rd party integrations with varying characteristics. If an order contains multiple items, each has to be <a href="https://techblog.citystoragesystems.com/p/food-prep-time-prediction">prepared at the right time</a>. If an order is prepared in a facility with <a href="https://techblog.citystoragesystems.com/p/robotic-order-conveyance">conveyance robots</a>, it requires additional coordination. Our internal systems listen to a central order event message bus to understand and react to the state of orders in real time. Lost updates or long delays result in a poor experience for both kitchens and customers.</p><p>The nature of order processing presents some unique challenges. To this end, we designed the Keyed Event Queue (KEQ) service as a central message bus for order processing. Today, all orders and order related events flow through the system alongside other internal traffic at Atoms.</p><h1>Event-driven order processing</h1><p>Order processing is inherently reactive. Our software systems observe and interact with the physical world, where progress is driven by real-world events, for instance, &#8220;customer places an order&#8221;, &#8220;order ready to pick up&#8221;, &#8220;driver arrives&#8221;, &#8220;item removed from locker&#8221;, etc. Event-driven systems are commonly structured around an asynchronous, durable message bus, managed by a <em>message broker</em>.&nbsp;</p><p>The Atoms order processing system follows this approach:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K3Ay!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K3Ay!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 424w, https://substackcdn.com/image/fetch/$s_!K3Ay!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 848w, https://substackcdn.com/image/fetch/$s_!K3Ay!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 1272w, https://substackcdn.com/image/fetch/$s_!K3Ay!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K3Ay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png" width="1456" height="218" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:218,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K3Ay!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 424w, https://substackcdn.com/image/fetch/$s_!K3Ay!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 848w, https://substackcdn.com/image/fetch/$s_!K3Ay!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 1272w, https://substackcdn.com/image/fetch/$s_!K3Ay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde3f2b8d-cf59-451f-a2c1-5dc1ba0fc5f2_2730x408.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">The event-driven &#8220;message bus&#8221; architecture</figcaption></figure></div><p>Orders transition through a dozen or more events as part of fulfillment, some of which are generated internally by our systems or devices. Each consumer service is responsible for a distinct aspect of fulfillment, such as ticket printing or robot conveyance, and reacts to order updates for that purpose. Order progress is driven by events and actions in concert with each other and the physical world. Many thousands of orders are typically in progress simultaneously.</p><p>In our kitchens, timeliness and event ordering are important. We originally used <a href="https://kafka.apache.org/">Apache Kafka</a>, a widely-used, open-source message broker that offers total event ordering with high performance. Kafka is a standard solution.</p><p>However, as order volume grew, our systems started to run afoul of how Kafka works and what it is designed to do well. First, the distinctly heterogeneous order processing frequently involves 3rd party integrations with at times transient failures or delays. Second, our software stack is stretched across multiple regions to overcome regional cloud provider failures. Neither works well with Kafka or similar message brokers.</p><h1>Head-of-Line (HOL) blocking and failure handling</h1><p>With Kafka, the order processing message bus is represented as a <em>topic</em>. A topic acts as a queue where emitted events are delivered to each consumer in the same order. Kafka scales its parallel processing by dividing topics into a fixed number of distributed queues called <em>partitions</em>. Each message provides a key that determines its partition. Event ordering is preserved within partitions. With N partitions, up to N consumer instances can process events concurrently.</p><p>Although events are processed concurrently across partitions, each partition still contains the events of thousands of different interleaved orders. And because a partition is processed sequentially (with batching), failure to process any event blocks further progress for the whole partition. This situation is known as Head-of-Line (HOL) blocking.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kxt-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kxt-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 424w, https://substackcdn.com/image/fetch/$s_!Kxt-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 848w, https://substackcdn.com/image/fetch/$s_!Kxt-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 1272w, https://substackcdn.com/image/fetch/$s_!Kxt-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kxt-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png" width="1456" height="336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:336,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25407,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kxt-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 424w, https://substackcdn.com/image/fetch/$s_!Kxt-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 848w, https://substackcdn.com/image/fetch/$s_!Kxt-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 1272w, https://substackcdn.com/image/fetch/$s_!Kxt-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fd3783-7e83-493c-9355-e72eca25f532_1476x341.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Interleaved order updates for two orders, illustrating HOL blocking on a shared Kafka partition</figcaption></figure></div><p>The big question is: if a consumer fails or is slow to process an event, what do we do?</p><p>For <em>time-sensitive</em> processing, there are no good simple options. We can wait for event processing to succeed, but that may take a while, during which all later events in the partition are stalled. Or we can ignore the message and move on, which avoids the delay but may fail to act on something important. Either option leads to poor experiences and order cancellations.</p><p>To mitigate HOL blocking, consumers must defer failing or slow events one way or another. A common option is to move problematic events to another topic, a so-called Dead-Letter-Queue (DLQ), for later processing. A DLQ ensures progress without losing events.</p><p>However, a DLQ introduces other problems: for example, it breaks event ordering when postponed events are processed later than intended. And for transient failures or delays, when exactly should we give up? If we give up too quickly, then even small disruptions create outsized ordering inversions. If too slowly, then processing is still delayed. Such logic is hard to get right when seconds matter. And how do we handle slow or failing DLQ events?</p><p>A DLQ works best when it is immediately apparent that an event is &#8220;dead&#8221; and has no hope of being processed, such as corrupt or invalid data records. This is indeed the case for a large number of Kafka uses. But for order processing it is not quite so simple.</p><p>The reality with HOL blocking is that time-sensitive processing involves complicated failure handling. While using a DLQ is a standard solution, it is a solution for another problem. However, since HOL blocking is a consequence of partition mechanics, what if we designed a system to avoid it?</p><h1>KEQ: A different message broker</h1><p>Keyed Event Queue (KEQ) is a new message broker based on a deceptively simple idea to avoid HOL blocking: instead of using N permanent partitions, use a separate, temporary partition for every message key. It is the exact ordering we ultimately want for orders: the events of each order are in sequence, but separate orders can progress independently. HOL blocking is then a non-issue, because the scope of each partition is a single order.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3t2p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3t2p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 424w, https://substackcdn.com/image/fetch/$s_!3t2p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 848w, https://substackcdn.com/image/fetch/$s_!3t2p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 1272w, https://substackcdn.com/image/fetch/$s_!3t2p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3t2p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png" width="1456" height="837" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:837,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47578,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3t2p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 424w, https://substackcdn.com/image/fetch/$s_!3t2p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 848w, https://substackcdn.com/image/fetch/$s_!3t2p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 1272w, https://substackcdn.com/image/fetch/$s_!3t2p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c48f8e-a074-4881-9fa6-855c3d17fc3f_1476x848.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Three independent orders in KEQ without HOL blocking despite event processing failures</figcaption></figure></div><p>There are additional benefits to this design. Standard techniques for handling transient errors apply, such as retries with exponential backoff. Potentially slow actions can be performed synchronously. Consumer logic &#8211; especially failure handling &#8211; becomes far simpler. The problems brought by time-sensitive processing and fixed partitions are largely gone.</p><p>In addition, there is no N to pick and adjust for the number of partitions as order volume grows. The number of partitions is dynamically determined by the data.</p><p>The primary tradeoff with this design is that bookkeeping becomes expensive. Where progress is tracked by a small, fixed number of partition cursors in Kafka for each consumer, KEQ requires millions of such cursors to keep track of the individual progress of each order. Consumers are thereby comparatively heavyweight.</p><p>For order processing, neither tradeoff is an issue. Moreover, consumers are almost always caught up and it is desirable that a new event for an order can be sent immediately and processed individually.</p><h1>Multi-region, low-latency message delivery</h1><p>KEQ is a multi-region, high-performance and scalable message broker for managing a large number of independent strictly-ordered message queues. It provides an at-least-once ordered delivery guarantee as well as a processing exclusivity guarantee with explicit leases. For consumers, this means that events are delivered in order with no other consumer instances trying to process the same event. This is important when consumer actions involve slow non-idempotent side effects.</p><p>KEQ uses an active-active, multi-region distributed SQL database to store messages, cursors, and metadata. That choice makes a tradeoff: it ensures KEQ can honor its guarantees even during a regional cloud provider outage, but I/O latency is necessarily higher than a single-region database.&nbsp;</p><p>For performance, KEQ maintains authoritative, sharded in-memory state for each topic using our coordinator for stateful services <a href="https://techblog.citystoragesystems.com/p/easy-as-pie-stateful-services-at">Splitter</a>. It allows KEQ to track progress and deliver new messages without reading from the database; order and progress updates can be blind writes. New messages are briefly cached and then evicted once all consumers have processed them, usually within seconds.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C5ff!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C5ff!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 424w, https://substackcdn.com/image/fetch/$s_!C5ff!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 848w, https://substackcdn.com/image/fetch/$s_!C5ff!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 1272w, https://substackcdn.com/image/fetch/$s_!C5ff!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C5ff!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png" width="1120" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1120,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C5ff!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 424w, https://substackcdn.com/image/fetch/$s_!C5ff!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 848w, https://substackcdn.com/image/fetch/$s_!C5ff!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 1272w, https://substackcdn.com/image/fetch/$s_!C5ff!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe48eb28e-ca9a-4995-b2a8-5dde2f24cfa4_1120x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Serving post-commit messages from the authoritative in-memory cache</figcaption></figure></div><p>KEQ also optimizes its internal communication, a necessity for low-latency multi-region services. Internally, each topic is divided into key ranges with an explicit region. These key ranges are dynamically assigned by the Splitter to KEQ instances in their region, so that database operations may benefit from region-locality.</p><p>When a consumer instance connects to KEQ, it is internally assigned to specific ranges and a streaming connection is made to each range owner. New and pending messages are thereafter streamed to the consumer, typically from memory and otherwise from the database if it has fallen too far behind.</p><p>In addition to topic ranges, KEQ runs an exclusive global leader responsible for allocating and load-balancing ranges between connected consumers. For instance, if two consumer instances are connected, each is assigned half of the ranges while taking into account which ranges are closest to each consumer. This work allocation is dynamic as instances come and go and factors in both region-affinity and node-affinity to minimize delivery latency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DhJX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DhJX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 424w, https://substackcdn.com/image/fetch/$s_!DhJX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 848w, https://substackcdn.com/image/fetch/$s_!DhJX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 1272w, https://substackcdn.com/image/fetch/$s_!DhJX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DhJX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png" width="1456" height="514" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:514,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55322,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DhJX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 424w, https://substackcdn.com/image/fetch/$s_!DhJX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 848w, https://substackcdn.com/image/fetch/$s_!DhJX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 1272w, https://substackcdn.com/image/fetch/$s_!DhJX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef0249f7-cee2-41d6-8fe7-4199beaa6c56_1792x633.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dynamic allocation of Ranges respecting region and node affinity</figcaption></figure></div><p>The centralized dynamic allocation simplifies the operational model. Consumer instances are ephemeral and need no individual configuration or coordination; they simply connect to make themselves available. Work is assigned to whatever consumers are present. Wide consumer failures and fast up-scaling are handled equally naturally.</p><p>As a result, KEQ works well with auto-scaled multi-region consumers.</p><h1>Failure-handling revisited</h1><p>KEQ is built for reliable real-time order processing in adverse conditions. Its active-active multi-region design offers elastic scalability without manual regional failover, explicit re-partitioning or temporary ordering violations. Order event consumers can freely scale up or down to match the rhythm of the restaurant business.</p><p>KEQ&#8217;s main value is in simplifying how consumers handle failure, transforming failures into delays. With no HOL blocking, consumers can retry transient failures indefinitely without impacting other orders. A DLQ is not needed. Processing failures &#8211; and how they are overcome &#8211; is handled inline. Even consumer code bugs can be fixed in a reasonable time, which lets consumers sidestep complex failure-handling and detection in favor of a trivial retry.&nbsp;</p><p>For widespread system failures, HOL blocking does not matter because no processing will succeed. But consumer code, in practice, never knows which kind of failure is happening, yet it must make a real-time decision. KEQ simplifies that decision. And as Dijkstra put it, "Simplicity is prerequisite for reliability".</p>]]></content:encoded></item><item><title><![CDATA[Data-Driven Automated Marketing for Restaurants]]></title><description><![CDATA[How Otter Marketing helps restaurants to increase revenue]]></description><link>https://techblog.atoms.co/p/automated-marketing-for-restaurants</link><guid isPermaLink="false">https://techblog.atoms.co/p/automated-marketing-for-restaurants</guid><pubDate>Tue, 07 May 2024 13:00:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qfDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qfDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qfDq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 424w, https://substackcdn.com/image/fetch/$s_!qfDq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 848w, https://substackcdn.com/image/fetch/$s_!qfDq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 1272w, https://substackcdn.com/image/fetch/$s_!qfDq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qfDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png" width="310" height="225.29577464788733" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:710,&quot;resizeWidth&quot;:310,&quot;bytes&quot;:88298,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qfDq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 424w, https://substackcdn.com/image/fetch/$s_!qfDq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 848w, https://substackcdn.com/image/fetch/$s_!qfDq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 1272w, https://substackcdn.com/image/fetch/$s_!qfDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2b3a9b-aeb4-438e-b0f7-28b6b14e91ad_710x516.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p><em>Written by Xiangyu Sun and Ye Tian, members of the engineering teams that work on Otter Marketing.</em></p><p>Running promotions and ads on Online Ordering Platforms has become an effective marketing tool to increase exposure and drive sales for restaurants. However, making informed decisions about which promotions to deploy, customizing them for different times and platforms, and assessing their performance remains challenging. Relying solely on manual processes for campaign management can be not only labor-intensive but also suboptimal.</p><p>We've developed an <a href="https://tryotter.com/products/marketing">automated marketing system</a> inside Otter, the all-in-one restaurant operating system. It has helped Otter customers boost their online marketing with an average revenue uplift of 12%. In this blog post, we'll explore the key factors and design decisions that contributed to this success.</p><h2>Prediction and optimization</h2><p>At first glance, manually creating a static timetable for marketing campaigns may seem straightforward, but this approach becomes impractical due to escalating complexity. With numerous configurations targeting different audiences, discount levels, and so on, the combinations can easily exceed hundreds. When factoring in variables such as days, variations across different Online Ordering Platforms, and multiple store locations under management, the number of choices multiplies to tens of thousands.</p><p>We address this complexity by forecasting the potential outcomes for each option, allowing us to compare alternatives and identify which performs better. Then, we employ linear programming to search for the optimal combination of options, adhering to global constraints like the marketing budget.</p><h3>Prediction</h3><p>Historical data offers insights into how business outcomes fluctuate with the timing of marketing campaigns, as well as the impact of geolocation and restaurant type. Using this data, we constructed a suite of time-series regression models to predict key business indicators, such as revenue, order volume, profit, and marketing spend. We experimented with various modeling techniques, ranging from linear models to gradient-boosted trees and neural networks. Ultimately, we chose gradient-boosted trees implemented through <a href="https://github.com/microsoft/LightGBM">LightGBM</a> for its superior performance (see Table 1)&nbsp; and simplicity.</p><p>More importantly than the type of model, we discovered that feature engineering warranted greater effort to improve training metrics. As we aimed to capture the nuances between different stores and promotion types, one challenge was encoding this information within the feature set. Our initial method was to use one-hot encoding, assigning unique IDs to each restaurant and promotion type. However, this approach created high-dimensional sparse feature vectors that not only inflated the model size but also failed to capture semantic meanings. For instance, it did not readily reflect when two stores were in close proximity and served the same cuisine, or that promotions like &#8220;Spend $20, save $5 for all&#8221; only differed from &#8220;Spend $20, save $5 for new customers&#8221; in their target audience.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fZ1p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fZ1p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 424w, https://substackcdn.com/image/fetch/$s_!fZ1p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 848w, https://substackcdn.com/image/fetch/$s_!fZ1p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 1272w, https://substackcdn.com/image/fetch/$s_!fZ1p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fZ1p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png" width="1456" height="846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:846,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91796,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fZ1p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 424w, https://substackcdn.com/image/fetch/$s_!fZ1p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 848w, https://substackcdn.com/image/fetch/$s_!fZ1p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 1272w, https://substackcdn.com/image/fetch/$s_!fZ1p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a52d67e-f3b0-496f-85eb-4fd238825ce4_1466x852.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A common approach to address this high dimensional feature space is to project them into low-dimensional dense embedding spaces, allowing the system to reflect relative semantic closeness. However, this technique is generally more compatible with neural networks and demands a considerably larger amount of data and fine-tuning. Therefore, we opted to derive the low-dimensional space directly from the semantic meanings. Rather than employing restaurant IDs, we represented each restaurant by the demographic information of its location and operational characteristics, such as ratings and cuisine type. Likewise, instead of using promotion type IDs, we utilized marketing metadata including the intended audience, minimum spending threshold, maximum discount amount, and ad bidding values. This method yielded a much shorter and denser feature vector with significantly improved training metrics as shown in Table 1.</p><h3>Optimization</h3><p>In the absence of constraints, making marketing decisions can be straightforward&#8212;simply choose the option that maximizes daily revenue, for example. However, restaurateurs often face more granular requirements, such as running promotions only three days a week or keeping marketing spend below a certain threshold within a given period. When multiple Online Ordering Platforms or stores under management are subject to a shared pool of constraints, the combinatorial possibilities skyrocket, rendering the greedy approach ineffective.</p><p>Under these constraints, the challenge transforms into a global optimization problem: determining where, when, and how much of the marketing budget to allocate in order to maximize the aggregate goal across platforms and stores while adhering to the constraints. We formulated this as a mixed-integer programming problem and employed <a href="https://github.com/google/or-tools">OR-Tools</a> as the solver to identify the optimal combinations.</p><h3>Putting together</h3><p>We deployed cron workflows orchestrated by <a href="https://argoproj.github.io/">Argo</a> to execute training jobs concurrently, followed by inference and the linear programming solver. All pipelines are updated daily to accommodate new observations that come in every day and to reflect updated constraints.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L_KG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L_KG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 424w, https://substackcdn.com/image/fetch/$s_!L_KG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 848w, https://substackcdn.com/image/fetch/$s_!L_KG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 1272w, https://substackcdn.com/image/fetch/$s_!L_KG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L_KG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png" width="1456" height="277" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:277,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107644,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L_KG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 424w, https://substackcdn.com/image/fetch/$s_!L_KG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 848w, https://substackcdn.com/image/fetch/$s_!L_KG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 1272w, https://substackcdn.com/image/fetch/$s_!L_KG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f2cb882-d259-4332-b8b9-343751224d76_3396x646.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 1. Workflows of prediction, optimization and experimentation.</figcaption></figure></div><p>To enable experimentation, we implemented a shadow system, allowing us to modify models, post-process results, update linear programming solvers, and log and compare outcomes with the production system. Any changes to the models are subjected to offline evaluation to ensure they do not compromise training metrics. Before releasing model updates and alterations to the rest of the pipeline, they must undergo online A/B testing to confirm that the changes yield beneficial and statistically significant results. In the following section, we will detail the setups for online testing.</p><h2>Measurement and experimentation</h2><p>The shadow system acts as a testing ground where we can explore new features, refine models, and evolve our optimization algorithms. Once changes are primed for trial, we allocate a subset of stores to operate with the shadow system and compare their performance with a similar group of stores running on the production system.</p><h3>Timeline A/B</h3><p>The Timeline A/B experiment is conducted using two comparably profiled groups of stores over two consecutive, equal-length time frames: the control period and the treatment period. During the control period, both the control and treatment groups operate under identical configurations. Then, at the beginning of the treatment period, we transition the treatment group to the shadow system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_tqj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_tqj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 424w, https://substackcdn.com/image/fetch/$s_!_tqj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 848w, https://substackcdn.com/image/fetch/$s_!_tqj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 1272w, https://substackcdn.com/image/fetch/$s_!_tqj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_tqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png" width="1456" height="551" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:551,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_tqj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 424w, https://substackcdn.com/image/fetch/$s_!_tqj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 848w, https://substackcdn.com/image/fetch/$s_!_tqj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 1272w, https://substackcdn.com/image/fetch/$s_!_tqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe00320d5-ccec-4a84-8fca-3b319c91883c_1600x605.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2. Timeline A/B experiment subjects.</figcaption></figure></div><p>To eliminate any pre-existing biases in the groups, the control and treatment groups must not exhibit any statistically significant differences across any of the test dimensions such as daily average revenue. Our initial step uses high-level filtering criteria to select candidates for the experiment. Then, we take data from the period immediately preceding the start of the experiment, bucketize it based on a primary dimension like revenue, attempt a random split within each bucket, and combine the subdivided buckets into control and treatment groups. Next, we conduct t-tests across all testing dimensions. A split is deemed successful if it shows no significance in any of the dimensions. If such a split proves challenging to achieve, we would then revisit and adjust the selection criteria for candidates.</p><p>The Timeline A/B experiment design enables store-level comparisons and difference-in-differences analyses. Using these methods, we can determine if changes implemented in the shadow system result in statistically significant improvements by controlling for temporal variations.</p><p>A/B testing results should also be congruent with human judgment for interpretability. For instance, we conducted a test to compare system responses under different objectives: maximizing order volumes versus maximizing profit, in scenarios without budget constraints. The outcomes were insightful. When the objective was to increase order volumes, the system suggested highly aggressive promotions, which entailed steep discounts with minimal qualifications. In contrast, when the goal was to maximize profits, the system recommended more conservative strategies, like offering free delivery exclusively to new customers or modest discounts with higher spending thresholds.</p><h3>Switchback</h3><p>In situations where the population is insufficient for timeline A/B splitting, or when the test is better conducted within the same cohort (like budget optimization only applies to the same organization), we utilize switchback testing. This method alternates between control and treatment settings at fixed intervals on the same cohort. Companies like <a href="https://medium.com/@DoorDash/switchback-tests-and-randomized-experimentation-under-network-effects-at-doordash-f1d938ab7c2a">DoorDash</a> have employed this technique to run experiments in scenarios where network effects might influence the outcome, where control and treatment groups are not independent and can impact each other.&nbsp; Because the experiment involves a single cohort, we can pair the control and treatment intervals and employ paired testing methods to enhance statistical power.</p><p>Switchback testing avoids the issue of uneven cohort splitting, but it does not completely eliminate the potential impact of time differences between control and treatment periods. To mitigate the risk of results being affected by coincidental timing, one strategy is to extend the duration of the experiments. Another approach is to run concurrent experiments on two cohorts with reversed control/treatment schedules and then pool the data for analysis.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bktb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bktb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 424w, https://substackcdn.com/image/fetch/$s_!bktb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 848w, https://substackcdn.com/image/fetch/$s_!bktb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 1272w, https://substackcdn.com/image/fetch/$s_!bktb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bktb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bktb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 424w, https://substackcdn.com/image/fetch/$s_!bktb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 848w, https://substackcdn.com/image/fetch/$s_!bktb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 1272w, https://substackcdn.com/image/fetch/$s_!bktb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5951ffc2-c510-4d5e-a15c-07c225ae9723_1466x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3. Switchback experiment and control/treatment group pairing.</figcaption></figure></div><p>We carried out an experiment to evaluate the effectiveness of global optimization across a large group of stores with a shared budget constraint. The control group utilized a greedy approach, optimizing decisions for each individual store, whereas the treatment group applied the global optimization method discussed earlier. Under this holistic optimization strategy, not every store receives the decision that would maximize its individual outcome, but it prevents marketing pauses due to overspending. This approach is particularly effective when budget constraints are tight. In our experiment, where the total discount could not exceed 30% of revenue, we observed a statistically significant increase of 7% in total order volume for the treatment group compared to the control group.</p><h3>Continuous monitoring</h3><p>Beyond online testing, which is designed for controlled experiments to validate system improvements, we also establish continuous monitoring of the production system to gauge the efficacy of marketing decisions. A key metric we monitor is the "incremental gain attributable to marketing," which we calculate by comparing scenarios with and without marketing at an aggregate level over a rolling window. Within a rolling window, we calculate the ratio of revenue between days with promotions and days without promotions. We expect that revenue on promotional days will be higher than on non-promotional days, hence the ratio should remain above 1.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5DjN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5DjN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 424w, https://substackcdn.com/image/fetch/$s_!5DjN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 848w, https://substackcdn.com/image/fetch/$s_!5DjN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 1272w, https://substackcdn.com/image/fetch/$s_!5DjN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5DjN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png" width="1456" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:336991,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5DjN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 424w, https://substackcdn.com/image/fetch/$s_!5DjN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 848w, https://substackcdn.com/image/fetch/$s_!5DjN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 1272w, https://substackcdn.com/image/fetch/$s_!5DjN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5237c4c-4095-4ad9-b677-496223f07ab7_2096x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4. Continuous monitoring of revenue lift by promotions. </figcaption></figure></div><p>If the benefits of marketing-led scenarios begin to diminish relative to non-marketed ones, we initiate a human-led investigation to identify potential system bugs or sudden shifts in market conditions. Fortunately, our experience thus far has demonstrated a consistent uplift from marketing efforts between 10% to 20%.</p><h2>Conclusion</h2><p>In this post, we discussed the development of Otter Marketing, a tool designed to help restaurant owners navigate online marketing across various platforms to maximize their budget's efficacy. Utilizing predictive models and optimization solvers, the system determines the most effective action from tens of thousands of possibilities each day. With our online testing and monitoring frameworks, we ensure that the benefits of marketing are tangible and significant, guaranteeing that our customers' promotional dollars are effectively utilized.</p>]]></content:encoded></item><item><title><![CDATA[Swapping Disks in Kubernetes for Fun and Profit]]></title><description><![CDATA[Introducing the PvcAutoscaler at City Storage Systems]]></description><link>https://techblog.atoms.co/p/swapping-disks-in-kubernetes</link><guid isPermaLink="false">https://techblog.atoms.co/p/swapping-disks-in-kubernetes</guid><pubDate>Tue, 23 Apr 2024 12:02:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ULU2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ULU2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ULU2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 424w, https://substackcdn.com/image/fetch/$s_!ULU2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 848w, https://substackcdn.com/image/fetch/$s_!ULU2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 1272w, https://substackcdn.com/image/fetch/$s_!ULU2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ULU2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png" width="1456" height="832" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:832,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ULU2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 424w, https://substackcdn.com/image/fetch/$s_!ULU2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 848w, https://substackcdn.com/image/fetch/$s_!ULU2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 1272w, https://substackcdn.com/image/fetch/$s_!ULU2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dfb4203-e734-4afc-ad8d-00aad3826174_1600x914.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by Jakob Schultz-Falk, member of the storage team at City Storage Systems who led the development of the PvcAutoscaler.</em></p><p>Since the introduction of the StatefulSet more and more stateful workloads have been added to the Kubernetes ecosystem. Unfortunately there are still plenty of caveats to running stateful workloads in Kubernetes, some of which are caused by fundamental limitations of the StatefulSet controller and the PVCs it generates.</p><p>While stateless workloads enjoy almost boundless elasticity in Kubernetes, stateful workloads, once deployed, are bound by the volumes they mount. These volumes are static and practically immutable, making it difficult to scale to match the ever changing needs of the workloads using them.</p><p>This post will present the solution we&#8217;ve developed at City Storage Systems aimed at reclaiming stateful elasticity in order to improve cost efficiency and reduce toil.</p><h1>The Problem</h1><p>While there is some support in Kubernetes for expanding storage volumes, there is no such mechanism for reducing storage capacity, nor for changing the underlying storage type. In fact the original Kubernetes Enhancement Proposal (KEP) for supporting volume expansion explicitly lists volume reduction as a <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/284-enable-volume-expansion#non-goals">non-goal</a> due to the complexities of shrinking volumes.</p><p>Furthermore, the volumeClaimTemplates section of the StatefulSet spec is currently immutable, preventing operators from expanding volumes via a high level API, though a future <a href="https://github.com/kubernetes/enhancements/issues/661">enhancement</a> has been proposed.</p><p>This leads to a situation where the total capacity provisioned only grows, and any storage related change requires a high-effort and risky migration, inflating the cost of running stateful workloads.</p><p>To summarize, the two primary culprits restraining stateful elasticity are:</p><ul><li><p>Only volume expansion is supported, and only by a subset of volume provisioners</p></li><li><p>Immutable volume claim templates embedded in the StatefulSet spec</p></li></ul><h1>Our Objectives</h1><p>Now that we have a clear understanding of the problem, let&#8217;s list the objectives we want to achieve with our new solution.</p><ul><li><p>Volume expansion for growing storage needs</p></li></ul><ul><li><p>Volume shrinking to reclaim cost overhead from unused storage</p></li></ul><ul><li><p>Volume modification, e.g. be able to swap out an HDD with an SSD or vice-versa</p></li></ul><p>All the objectives above should be solved by an on-demand, declarative, and toil-free solution which can be used by any software engineer accustomed to Kubernetes and StatefulSets.</p><h1>Kubernetes Storage Concepts</h1><p>Before diving into the inner workings of the PvcAutoscaler, let&#8217;s briefly introduce some of the core Kubernetes components we will rely on to achieve our objectives. These concepts are described in more detail in the official <a href="https://kubernetes.io/docs/concepts/storage/">Kubernetes storage documentation</a>.</p><h2>Persistent Volumes and Persistent Volume Claims</h2><p>Persistent Volumes (PV) are resources in Kubernetes which contain information regarding the underlying storage they represent. These are typically (but not necessarily) cloud provider managed disks.</p><p>Persistent Volume Claims (PVC) are resources which bind to PVs, thereby reserving usage of the PV across the cluster. Pods in the same namespace as the PVC can then reference the PVC to use the underlying storage.</p><h2>The StorageClass</h2><p>In order to provision different types of storage, Kubernetes offers the concept of a StorageClass. These can be used to describe different types of storage, e.g. one StorageClass could be configured to provision PVs backed by SSDs while another provides access to HDDs.</p><p>In general the process of provisioning a new PV/PVC combination to be used by a pod, is done by creating a PVC referencing a StorageClass. As the PVC is not bound to any PV at creation the volume provisioner of the StorageClass will attempt to provision a PV and bind it to the PVC.</p><p>The StorageClass contains a reference to a volume provisioner along with parameters needed to create a new PV backed by some underlying storage. E.g. a volume provisioner developed by a cloud provider would take the input parameters and provision a network attached disk with the desired configuration.</p><h2>The StatefulSet</h2><p>This resource type was added to Kubernetes in order to get stable naming mapped to persistent storage across pod restarts. So while the Deployment object would create pods with unique names on each creation, the StatefulSet will create pods with serial ordinals mapping one-to-one to a PVC of the same ordinal.</p><p>The PVCs can either be manually provisioned upfront, or it can be left up to the StatefulSet controller to create the PVCs based on the volume claim templates defined in its spec. As mentioned above the StatefulSet offers no support for changes to the PVCs or the underlying storage once provisioned.</p><h1>The PvcAutoscaler</h1><p>Now that we&#8217;ve introduced the core storage concepts we will be referencing, we can begin using them to build up our PvcAutoscaler solution. We start out by addressing the shortcomings of the StatefulSet.</p><h2>Taking Control</h2><p>Since volume claim templates in StatefulSets are immutable, our first task is to detach them so we can modify them as needed. We accomplish this by creating a new <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">Custom Resource Definition</a> (CRD) called PvcAutoscaler. One PvcAutoscaler resource can contain a single volume claim template and a reference to the StatefulSet whose PVCs it manages.</p><pre><code>apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nginx
  namespace: default
spec:
  serviceName: "nginx"
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: registry.Kubernetes.io/nginx:latest
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
<s>  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: www
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 4Gi
      storageClassName: hdd-sc
      volumeMode: Filesystem</s></code></pre><p><em>Figure 1: Splitting out the volume claim templates from the StatefulSet</em></p><pre><code>apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nginx
  namespace: default
spec:
  serviceName: "nginx"
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: registry.Kubernetes.io/nginx:latest
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
<strong>---
apiVersion: storage.css.com/v1
kind: PvcAutoscaler
metadata:
  name: nginx
  namespace: default
spec:
  statefulSetName: nginx
  volumeClaimTemplate:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: www
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 4Gi
      storageClassName: hdd-sc
      volumeMode: Filesystem</strong></code></pre><p><em>Figure 2: Moving the volume claim templates into PvcAutoscaler resources</em></p><p>Now that the volume claim templates are isolated into PvcAutoscaler resources we can create a Kubernetes operator for the CRD which will generate PVCs for each pod in the referenced StatefulSet based off of the embedded volume claim template.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ajW5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ajW5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 424w, https://substackcdn.com/image/fetch/$s_!ajW5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 848w, https://substackcdn.com/image/fetch/$s_!ajW5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 1272w, https://substackcdn.com/image/fetch/$s_!ajW5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ajW5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png" width="550" height="595.2222222222222" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1461,&quot;width&quot;:1350,&quot;resizeWidth&quot;:550,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ajW5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 424w, https://substackcdn.com/image/fetch/$s_!ajW5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 848w, https://substackcdn.com/image/fetch/$s_!ajW5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 1272w, https://substackcdn.com/image/fetch/$s_!ajW5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35700819-5d5e-481c-95d5-7db0c156171b_1350x1461.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 3: PvcAutoscaler becoming responsible for the creation of PVCs for a StatefulSet</em></p><p>We now have PVCs for the pods of a StatefulSet based off of a volume claim template embedded in a PvcAutoscaler custom resource. Unfortunately the StatefulSet controller is completely unaware of these PVCs and as such cannot include them in the Pod specs to make them mountable volumes. This brings us to our next task.</p><h2>Attaching Volumes</h2><p>Luckily Kubernetes gives us a tool we can use to ensure PVCs are available as volumes in the pod spec. By adding a&nbsp; mutating webhook targeting the pods of StatefulSets referenced by a PvcAutoscaler, we can look up and attach the PVCs to a pods&#8217; volumes spec prior to validation and creation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TwUd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TwUd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 424w, https://substackcdn.com/image/fetch/$s_!TwUd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 848w, https://substackcdn.com/image/fetch/$s_!TwUd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 1272w, https://substackcdn.com/image/fetch/$s_!TwUd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TwUd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png" width="1456" height="1093" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1093,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TwUd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 424w, https://substackcdn.com/image/fetch/$s_!TwUd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 848w, https://substackcdn.com/image/fetch/$s_!TwUd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 1272w, https://substackcdn.com/image/fetch/$s_!TwUd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfa732c-52dd-4405-8d81-c3d357179d46_1600x1201.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 4: Mutating pod webhook attaching PvcAutoscaler PVCs to StatefulSet pods</em></p><p>After attaching the PVCs, each pod now has a valid spec with volumeMounts all referencing correctly declared volumes.</p><h2>Expansion Unlocked</h2><p>After all that work we have returned to a functional setup similar to the plain StatefulSet. We can define a StatefulSet and a number of PvcAutoscaler resources which will manage PVCs and attach them to the pods at creation.</p><p>However we have unlocked one great benefit from this exercise. We have gained complete control over the volume claim template embedded in the PvcAutoscaler and can decide what limitations we wish to impose, i.e. we can make properties mutable.</p><p>The first target for mutability is the .spec.resources.storage.requests property since our first objective is to allow seamless volume expansion. Allowing volume expansion when the underlying volume provisioner natively supports it is trivial - we just propagate the change in requests to all PVCs and wait for the provisioner to finish the expansion.</p><p>In the cases where volume expansion is not natively supported, we would need to follow a more cumbersome process detailed in the next sections.</p><h2>The Volume Populator</h2><p>While volume expansion is supported by some volume provisioners, neither shrinking volumes nor changing the underlying storage device are in any way supported. To enable these features we introduce a new custom component, the volume-populator.</p><p>PVCs have for some time supported the DataSource property, originally intended to bootstrap a PVC from a volume snapshot. However by enabling the AnyVolumeDataSource gate (which is enabled by default since 1.24) it is possible to use the DataSource property to reference any custom resource (CR).</p><p>To leverage this we create a new CRD called the PvcSourcePopulator, which is owned by the volume-populator and whose only purpose is to reference the old PVC, and can itself be referenced in a new PVC through the DataSource property.</p><pre><code>apiVersion: storage.css.com/v1
kind: PvcSourcePopulator
metadata:
  name: populator-pvc-0-ba51f
  namespace: default
spec:
  sourcePvcRef:
    name: pvc-0-8770e
    namespace: default
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-0-ba51f
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  dataSourceRef:
    apiGroup: storage.css.com
    kind: PvcSourcePopulator
    name: populator-pvc-0-ba51f
  resources:
    requests:
      storage: 2Gi
  storageClassName: ssd-sc
  volumeMode: Filesystem</code></pre><p><em>Figure 5: A PvcSourcePopulator CR being referenced by a new PVC with a different storage class and smaller requested capacity</em></p><p>The volume-populator is responsible for transferring the content of the old PVC referenced in the PvcSourcePopulator to the new PVC. The process it follows can be summarized in the following bullets:</p><ol><li><p>PVC<sub>new</sub> is created with a PvcSourcePopulator data source referencing PVC<sub>old</sub>. PVC<sub>new</sub> enters a Pending state since the volume provisioner does not recognize the datasource type - this allows us to take over the task of eventually binding a PV to PVC<sub>new</sub></p></li></ol><ol start="2"><li><p>The volume-populator monitors pods in the namespace waiting for the PVC<sub>new</sub> to be referenced to ensure the data transfer is executed immediately prior to pod initialization</p></li></ol><ol start="3"><li><p>Once PVC<sub>new</sub> is referenced, the volume-populator creates an empty PVC<sub>tmp</sub> with identical content to PVC<sub>new</sub>, but without a datasource. Without the datasource, the volume provisioner is able to create a PV and bind it to PVC<sub>tmp</sub></p></li></ol><ol start="4"><li><p>The volume-populator spins up a pod which mounts both the PVC<sub>old</sub> and PVC<sub>tmp</sub>, and transfers content from PVC<sub>old</sub> to PVC<sub>tmp</sub></p></li></ol><ol start="5"><li><p>Once transfer completes, the volume-populator unbinds the PV from PVC<sub>tmp</sub> and instead binds it to PVC<sub>new</sub></p></li></ol><ol start="6"><li><p>The volume-populator discards PVC<sub>tmp</sub> which no longer has a PV bound to it</p></li></ol><ol start="7"><li><p>PVC<sub>new</sub> becomes ready as it now has a PV bound and the pod can initialize normally</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!28fh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!28fh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 424w, https://substackcdn.com/image/fetch/$s_!28fh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 848w, https://substackcdn.com/image/fetch/$s_!28fh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 1272w, https://substackcdn.com/image/fetch/$s_!28fh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!28fh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png" width="1456" height="801" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:801,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!28fh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 424w, https://substackcdn.com/image/fetch/$s_!28fh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 848w, https://substackcdn.com/image/fetch/$s_!28fh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 1272w, https://substackcdn.com/image/fetch/$s_!28fh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af68041-0c0d-4fc2-8def-efaeb789b471_1600x880.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 6: The volume-populator bootstrapping a new PVC for pod-0</em></p><h2>Just Change</h2><p>By using the volume-populator in the PvcAutoscaler operator we can efficiently and safely swap out the PVC to match the desired state of the volume claim template. This allows us to support changes to almost all aspects of the volume claim template, including the storage class to change the underlying storage device.</p><p>When the PvcAutoscaler operator detects drift between the volume claim template and the current PVCs of a StatefulSet it will try to determine whether the change can be done via online expansion. If it cannot, it initiates the process of swapping out the PVCs using the volume-populator. This process can be summarized by the following:</p><ol><li><p>Create new PVCs for each pod containing a PvcSourcePopulator DataSource referencing the old PVC</p></li></ol><ol start="2"><li><p>Initiate a rolling restart of the StatefulSet</p></li></ol><ol start="3"><li><p>Upon pod creation the mutating webhook will inject the new PVC</p></li></ol><ol start="4"><li><p>The volume-populator will detect a pod being created with a reference to a PVC with a PvcSourcePopulator DataSource</p></li></ol><ol start="5"><li><p>The volume-populator will transfer the data and once it completes the pod will start up normally</p></li></ol><ol start="6"><li><p>This process is repeated for each pod in the StatefulSet</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qbAe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qbAe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 424w, https://substackcdn.com/image/fetch/$s_!qbAe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 848w, https://substackcdn.com/image/fetch/$s_!qbAe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 1272w, https://substackcdn.com/image/fetch/$s_!qbAe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qbAe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png" width="1456" height="1188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1188,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qbAe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 424w, https://substackcdn.com/image/fetch/$s_!qbAe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 848w, https://substackcdn.com/image/fetch/$s_!qbAe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 1272w, https://substackcdn.com/image/fetch/$s_!qbAe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d017a6b-59e0-4d4f-89c3-8aa628650a06_1600x1305.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 7: Swapping out the PVCs by using the volume-populator to transfer all data from the previous PVC</em></p><p>Once all the pods have been restarted and had their PVC bootstrapped, the StatefulSet re-enters a steady state. After a pre-configured retention period the PvcAutoscaler will automatically delete the old PVCs to reclaim any cost associated.&nbsp;</p><p>The capabilities of the PvcAutoscaler allows us to ignore future requirements for a stateful workload's storage characteristics, and focus on the current need. We can start with a small HDD, confident that we can boost it to SSD if the need arises. We can add on extra capacity, knowing that we can scale-in again to save on cost later. All on-demand, declarative, and toil-free.</p><h1>Next Steps</h1><p>The current solution is quite robust and battle tested, and has drastically improved the cost efficiency of our stateful workloads, but there is always room for improvement. In the future we&#8217;re likely to explore the following areas.</p><ul><li><p>Actual auto-scaling to PvcAutoscaler allowing automated execution of PVC modifications based off of declared sizing strategies and disk metrics</p></li></ul><ul><li><p>Support for stateful workloads not using StatefulSets but are instead running using a custom pod controller</p></li></ul><ul><li><p>Hot copying of data to reduce the time it takes to scale down disks and switch storage types</p></li></ul><h1>Conclusion</h1><p>The key takeaway we&#8217;ve taken to heart from developing the PvcAutoscaler is that Kubernetes is immensely extensible. With enough insight into the ecosystem it is possible to extend even core components with new features and remove otherwise hard limitations. This ability to extend kubernetes is especially relevant for stateful workloads which tend to have more specific requirements and limitations compared to stateless workloads.</p><p>Overall the PvcAutoscaler has provided great value since it was rolled out. While the original focus was on reducing cost by eliminating wasteful over-provisioning of storage capacity, the PvcAutoscaler has also enabled us to easily move workloads onto more performant disks when usage reached IOPS and throughput limits.</p><p>Without the PvcAutoscaler we would have been forced to expend a lot more time and resources to right-size the storage of our stateful workloads as consumer workload usage patterns change over time. In many instances the effort required to scale in storage would not be worth the cost reduction, leading to an ever increasing overhead. But the PvcAutoscaler makes even small cuts in storage worthwhile.</p>]]></content:encoded></item><item><title><![CDATA[How OtterPOS Maintains Business Continuity During Outages]]></title><description><![CDATA[The evolution of "offline mode"]]></description><link>https://techblog.atoms.co/p/otterpos-outage-resilience</link><guid isPermaLink="false">https://techblog.atoms.co/p/otterpos-outage-resilience</guid><pubDate>Tue, 09 Apr 2024 13:01:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_llc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_llc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_llc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 424w, https://substackcdn.com/image/fetch/$s_!_llc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 848w, https://substackcdn.com/image/fetch/$s_!_llc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 1272w, https://substackcdn.com/image/fetch/$s_!_llc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_llc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png" width="1456" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:800896,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_llc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 424w, https://substackcdn.com/image/fetch/$s_!_llc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 848w, https://substackcdn.com/image/fetch/$s_!_llc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 1272w, https://substackcdn.com/image/fetch/$s_!_llc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eedb6ad-a524-419c-b372-e19c657fffac_2136x992.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>This post was written by Adam Share, an engineer that helped build Otter POS's offline mode. Matt Park and Tim Pinkawa contributed notes on backend design.</em></p><p>A little over one year ago, the Otter team embarked on a mission to build a point-of-sale (POS) product that meets the needs of many of our restaurant partners. Otter's Order Manager product has already proven an invaluable tool for managing a restaurant's "online" orders from food delivery providers, but what about all those in-person "offline" orders that make up 80% of a restaurant's business?</p><p>This post describes how we built a single solution to manage both your online and offline orders using a restaurant's existing hardware, and how that product maintains business continuity in the face of unexpected outages and connectivity issues. Our goal is to deliver the most reliable POS experience in the industry: no orders lost, ever.</p><h1>OtterPOS: Going online</h1><p>OtterPOS combined all the things we're good at in the online world with new hardware and software to support the offline world. Payments are through a stripe terminal and a large screen for easily customizing new orders.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gykq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gykq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 424w, https://substackcdn.com/image/fetch/$s_!gykq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 848w, https://substackcdn.com/image/fetch/$s_!gykq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 1272w, https://substackcdn.com/image/fetch/$s_!gykq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gykq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png" width="1184" height="648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:648,&quot;width&quot;:1184,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gykq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 424w, https://substackcdn.com/image/fetch/$s_!gykq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 848w, https://substackcdn.com/image/fetch/$s_!gykq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 1272w, https://substackcdn.com/image/fetch/$s_!gykq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f7e13a1-e5a6-44a5-9095-f123efd48fe4_1184x648.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 1: Simplified architecture diagram V1</em></figcaption></figure></div><p>This version worked well, and we delighted our initial customers! They loved our simple user interface with a single home for all their orders. But, as we discussed our product with larger restaurant groups, we learned there was an important feature they cared about that we didn't yet support:&nbsp;</p><p>What happens when the internet goes down? What if our servers go down? If you have a customer physically handing you money, it's unacceptable that your POS stops working for any reason.&nbsp;</p><p>Traditional POS systems solve this by running a large server onsite in the restaurant. The POS and other devices connect to the server and the server connects to the internet. When the internet goes down, the server insulates the devices by continuing to process orders. This has some drawbacks though - restaurants are required to buy and install the server, the server is a single point of failure, and onboarding to the system is a lot more complicated.</p><p>Can we do better?</p><h1>OtterPOS: Off the grid</h1><p>So we built "offline mode".</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o6Ne!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o6Ne!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 424w, https://substackcdn.com/image/fetch/$s_!o6Ne!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 848w, https://substackcdn.com/image/fetch/$s_!o6Ne!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!o6Ne!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o6Ne!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png" width="1456" height="925" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:925,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154880,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o6Ne!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 424w, https://substackcdn.com/image/fetch/$s_!o6Ne!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 848w, https://substackcdn.com/image/fetch/$s_!o6Ne!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!o6Ne!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821ea2bf-401e-442b-8e54-60b444788c75_2368x1504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 2: Simplified architecture diagram V2</em></figcaption></figure></div><p>Two new components allow our devices to continue to operate in the event of connectivity loss:</p><ol><li><p>A <strong>write-ahead-log</strong> in the OtterPOS device to capture all of the events that happened while the device was offline</p></li><li><p>An <strong>asynchronous workflow engine</strong> in our servers, built on <a href="https://temporal.io/">Temporal</a>, to process the log when connectivity is restored</p></li></ol><p>The write-ahead-log captures every event that happens on the device. Every order created, order update, payment made, refund given, and clock-in/clock-out are saved in persistent storage. This means that if the device is turned off for the night and turned back on for the morning shift, the events are still there and will attempt to be flushed. The events are kept in a local database until the device is sure that the server has received them and processed them, in which case they are cleared from the DB.</p><p>The device will attempt to flush the log after every new event, and periodically if it detects that connectivity has been restored. When the server receives the log, it immediately starts up a durable workflow to process the events. As soon as the server is sure the workflow is started, it returns the workflow ID so that the device knows the events have been received and will be processed.</p><p>Events are processed serially within the scope of a single order to ensure they are processed in the correct order. However, we process events from different orders in parallel with separate Temporal activities to ensure a failure from one order does not block the processing of another order.</p><p>In the event of a Temporal activity failure, the workflow will retry indefinitely to ensure the event is not lost. These events already happened in the real world, so we cannot fail to process and record them in our system of record. The durable nature of the workflow and our retry logic provide us with this guarantee.</p><p>For transient failures, the activity will eventually succeed on its own. For failures due to backend bugs, an engineer will fix the bug and redeploy, at which point the activity will automatically retry and succeed, moving on to the next event to process. In the event of a client bug where we receive unexpected bad input, we even have the ability to manually rerun a workflow with the correct input.</p><p>Upon success, we send an event status to our push framework which pushes the status back to the device. In the event of an unrecoverable error, we send a failure status back to the push framework. In either case, the device now knows the event was processed and is safe to clear from the DB.</p><p>Offline mode has already saved our customers from downtime. In March 2023, our cloud provider, Azure, experienced an outage, degrading some of our services. Our customers using OtterPOS did not notice because orders and payments were still processed locally. When the outage was over, our workflows processed all the pending events and everything worked as expected.</p><h1>OtterPOS: Offline evolution</h1><p>Offline mode works great when you have a single POS device, but what about restaurants with multiple POS devices? What about restaurants with KDS devices (kitchen display systems)? We still rely on our backend to sync state between the devices, so that an order started on one POS can be finished on the second POS or viewed in a KDS.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gc7a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gc7a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 424w, https://substackcdn.com/image/fetch/$s_!gc7a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 848w, https://substackcdn.com/image/fetch/$s_!gc7a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 1272w, https://substackcdn.com/image/fetch/$s_!gc7a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gc7a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png" width="1456" height="847" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:847,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gc7a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 424w, https://substackcdn.com/image/fetch/$s_!gc7a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 848w, https://substackcdn.com/image/fetch/$s_!gc7a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 1272w, https://substackcdn.com/image/fetch/$s_!gc7a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37cddddf-a56b-4d86-b2b5-3cecedc56499_1664x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 3: Complex restaurant setup. CSS: City Storage Systems backend servers, KDS: Kitchen Display System</em></figcaption></figure></div><p>The next version of our POS product will solve this by creating a mesh of interconnected devices all sharing state with each other and syncing events to our servers. This will allow all devices in the restaurant to share state even when the network goes down.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1na1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1na1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 424w, https://substackcdn.com/image/fetch/$s_!1na1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 848w, https://substackcdn.com/image/fetch/$s_!1na1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 1272w, https://substackcdn.com/image/fetch/$s_!1na1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1na1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png" width="1312" height="982" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:982,&quot;width&quot;:1312,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73674,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1na1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 424w, https://substackcdn.com/image/fetch/$s_!1na1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 848w, https://substackcdn.com/image/fetch/$s_!1na1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 1272w, https://substackcdn.com/image/fetch/$s_!1na1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff04251a3-647b-4406-b5c5-bf39c1a8d947_1312x982.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our local mesh network of devices will leverage an existing <a href="https://github.com/google/nearby">open source</a> peer-to-peer networking API <a href="https://developers.google.com/nearby/connections/overview">Nearby Connections</a>. Nearby Connections abstracts away Bluetooth, BLE, Wi-Fi, and LAN connections and will automatically connect using the best available method. Each device can connect to one or more devices over any of the available protocols to form the mesh.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GbA7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GbA7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 424w, https://substackcdn.com/image/fetch/$s_!GbA7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 848w, https://substackcdn.com/image/fetch/$s_!GbA7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 1272w, https://substackcdn.com/image/fetch/$s_!GbA7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GbA7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png" width="1456" height="479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43887,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GbA7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 424w, https://substackcdn.com/image/fetch/$s_!GbA7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 848w, https://substackcdn.com/image/fetch/$s_!GbA7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 1272w, https://substackcdn.com/image/fetch/$s_!GbA7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd48f6c94-229c-4f6d-a223-2ee613e081db_1616x532.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 5: Multiple networking protocols supported</em></figcaption></figure></div><p>When two devices establish a connection they will move communication to the highest possible bandwidth pathway. So if two devices find each other over Bluetooth first but join the same WLAN, they will move from a Bluetooth connection to a higher bandwidth WLAN connection.</p><p>With devices able to discover and connect to each other to form a mesh network, all app data can now be distributed over the local network. We&#8217;ll follow a &#8220;publish&#8211;subscribe&#8221; pattern to generalize how data is sent and received to enable existing features and future development to leverage the offline-first approach. Events that are published will also be cached so that if devices become disconnected or new devices connect, the relevant data can be exchanged to bring all devices back into sync.</p><p>In this model, the devices act as both clients and servers and are able to sync data between themselves without the backend orchestrating it. The backend is still essential for durable storage of events, but its responsibility is reduced from being an authoritative coordinator to becoming a peer node in a mesh of devices.</p><p>With our new local mesh functionality, all devices in the restaurant can seamlessly continue to operate with no downtime when connectivity is lost, fulfilling our promise of business continuity.</p><h1>Conclusion</h1><p>Customers have been delighted with our single solution to managing both your online and offline orders. But a great solution won't get you very far unless it is reliable and allows you to run your business even when things go wrong. This is why offline mode has been a key focus area for our team and continues to be a competitive advantage as we scale offline mode to more scenarios. We've already seen offline mode save our customers hundreds of orders during various unforeseen connectivity issues.</p><p>We will continue working to make our entire fleet of devices work while offline, including front-of-house Order Manager, back-of-house Kitchen Display Systems, and customer facing kiosks. They will all work seamlessly together even when the network drops.</p><p>Additionally, we plan to make more features available while offline. We focused our efforts first on ensuring business continuity by supporting order placement and payment. But we plan to bring offline mode to menu management, analytics, and more in the near future.</p>]]></content:encoded></item><item><title><![CDATA[Managing 100s of Kubernetes Clusters using Cluster API]]></title><description><![CDATA[Automating every step from cluster creation to workload-ready. Turtles all the way down.]]></description><link>https://techblog.atoms.co/p/managing-100s-of-kubernetes-clusters</link><guid isPermaLink="false">https://techblog.atoms.co/p/managing-100s-of-kubernetes-clusters</guid><pubDate>Tue, 26 Mar 2024 14:01:27 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8ec528f2-0ea8-4e52-87e4-f129052c741d_2000x1361.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jTN6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jTN6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 424w, https://substackcdn.com/image/fetch/$s_!jTN6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 848w, https://substackcdn.com/image/fetch/$s_!jTN6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!jTN6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jTN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png" width="1456" height="703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:703,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:771464,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jTN6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 424w, https://substackcdn.com/image/fetch/$s_!jTN6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 848w, https://substackcdn.com/image/fetch/$s_!jTN6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!jTN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41950956-1c34-4bc1-942d-a174356c2ccf_2094x1011.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Written by Zain Malik and Nibir Bora, members of the engineering teams that work on core infrastructure.</em></p><p>At City Storage Systems, our Core Infrastructure team navigates the complexities of managing over 100 multi-tenant Kubernetes clusters, each hosting up to tens of thousands of daily active pods. Our entire software stack runs on Kubernetes from mission critical microservices to stateful databases and observability solutions.</p><p>This blog delves into our journey of achieving complete automation in cluster provisioning, lifecycle management, and upgrades. With the new toolset, we slashed the time required to provision and prepare a workload-ready cluster from 1.5 weeks to under 1 day, all while maintaining a lean team of engineers. This transformation was catalyzed by our strategic decision to migrate to Microsoft Azure within a deadline of a few months. During the transition the number of clusters we operate more than doubled.</p><p>We present a set of Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">custom resources</a> and <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operators</a> that models our infrastructure and associated operations. The flexibility of the Kubernetes operator pattern makes this approach extremely powerful and we are confident it can be used to manage clusters on any public cloud provider.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LQzQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LQzQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 424w, https://substackcdn.com/image/fetch/$s_!LQzQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 848w, https://substackcdn.com/image/fetch/$s_!LQzQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 1272w, https://substackcdn.com/image/fetch/$s_!LQzQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LQzQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png" width="1456" height="1073" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1073,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:155776,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LQzQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 424w, https://substackcdn.com/image/fetch/$s_!LQzQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 848w, https://substackcdn.com/image/fetch/$s_!LQzQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 1272w, https://substackcdn.com/image/fetch/$s_!LQzQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F735b07a2-b10e-4030-b8ca-4372ab3d02c6_2263x1667.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Hierarchy of all custom resources used to manage a Kubernetes cluster and node pools.</figcaption></figure></div><h1>Adopting Cluster API</h1><p>We initially created clusters using Terraform and thereon managed node pools using a custom homegrown Kubernetes operator. Changes, including Kubernetes version upgrades, were handled via GitOps. However, the cognitive overhead and manual intervention required made this approach unsustainable, specially to migrate 80+ clusters from one cloud provider to another.</p><p>This led us to explore <a href="https://cluster-api.sigs.k8s.io/">Cluster API</a>, which offers declarative APIs for simplifying provisioning, upgrading, and managing multiple Kubernetes clusters. Two key factors made Cluster API particularly attractive:</p><ol><li><p>Extensibility: Cluster API's custom resources are extended by <a href="https://cluster-api.sigs.k8s.io/reference/providers.html">provider</a> custom resources, which can then be extended by higher level custom resources that capture organizational needs. Since our clusters are multi-tenant, this allowed us to abstract away any details about node pools from workload developers.</p></li><li><p>Operator Pattern: Any extension of Cluster API and its providers are Kubernetes Operators, aligning with our Kubernetes-centric approach to infrastructure management. Leveraging our team's experience with building operators, the learning curve was minimal.</p></li></ol><p>To create a cluster using Cluster API and the Cluster API provider for Azure (CAPZ) we simply need to create objects of the following custom resources:</p><ol><li><p><code>Cluster</code> (from Cluster API)</p></li><li><p><code>AzureManagedCluster</code> and <code>AzureManagedControlPlane</code> (from CAPZ)</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T_z0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T_z0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 424w, https://substackcdn.com/image/fetch/$s_!T_z0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 848w, https://substackcdn.com/image/fetch/$s_!T_z0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 1272w, https://substackcdn.com/image/fetch/$s_!T_z0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T_z0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png" width="1456" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91650,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T_z0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 424w, https://substackcdn.com/image/fetch/$s_!T_z0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 848w, https://substackcdn.com/image/fetch/$s_!T_z0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 1272w, https://substackcdn.com/image/fetch/$s_!T_z0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bdfa81-374f-41cb-ab43-8bcc0a54e3f0_1456x592.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4MTt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4MTt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 424w, https://substackcdn.com/image/fetch/$s_!4MTt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 848w, https://substackcdn.com/image/fetch/$s_!4MTt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 1272w, https://substackcdn.com/image/fetch/$s_!4MTt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4MTt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png" width="1456" height="852" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:852,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182028,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4MTt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 424w, https://substackcdn.com/image/fetch/$s_!4MTt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 848w, https://substackcdn.com/image/fetch/$s_!4MTt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 1272w, https://substackcdn.com/image/fetch/$s_!4MTt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbca26f04-8570-47f9-a972-4c0506a560dd_1456x852.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qYW4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qYW4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 424w, https://substackcdn.com/image/fetch/$s_!qYW4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 848w, https://substackcdn.com/image/fetch/$s_!qYW4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 1272w, https://substackcdn.com/image/fetch/$s_!qYW4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qYW4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png" width="1456" height="294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:294,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57239,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qYW4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 424w, https://substackcdn.com/image/fetch/$s_!qYW4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 848w, https://substackcdn.com/image/fetch/$s_!qYW4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 1272w, https://substackcdn.com/image/fetch/$s_!qYW4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f6ca26-159f-4da8-95ac-e8d750279794_1456x294.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Similarly, to create a node pool we need to create objects of the following custom resources:</p><ol><li><p><code>MachinePool</code> (from Cluster API)</p></li><li><p><code>AzureManagedMachinePool</code> (from CAPZ)</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fpnv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fpnv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!Fpnv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!Fpnv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!Fpnv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fpnv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fpnv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!Fpnv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!Fpnv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!Fpnv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1993e4-d57e-43a3-ac72-b0961df74422_1456x816.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zDpZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zDpZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 424w, https://substackcdn.com/image/fetch/$s_!zDpZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 848w, https://substackcdn.com/image/fetch/$s_!zDpZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 1272w, https://substackcdn.com/image/fetch/$s_!zDpZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zDpZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png" width="1456" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd0478f0-121d-468d-a068-7af386804540_1456x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126191,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zDpZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 424w, https://substackcdn.com/image/fetch/$s_!zDpZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 848w, https://substackcdn.com/image/fetch/$s_!zDpZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 1272w, https://substackcdn.com/image/fetch/$s_!zDpZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0478f0-121d-468d-a068-7af386804540_1456x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We were able to seamlessly integrate these processes into our existing CI pipelines, relying fully on GitOps for cluster management. We do not use the <code>clusterctl</code> CLI provided by Cluster API.</p><p>However, there were three major roadblocks to adopting Cluster API immediately:</p><ol><li><p>Limited support for managed Kubernetes distributions in Cluster API. Although this was part of Cluster API&#8217;s roadmap, most of the work done was focused on self-managed Kubernetes.</p></li><li><p>CAPZ&#8217;s support for Azure Kubernetes Service (AKS), the managed Kubernetes distribution, was still in an experimental phase, lacking essential features required for our use case.</p></li><li><p>No major engineering organization was using Cluster API for AKS (not that we knew of at the time).</p></li></ol><p>We leaned on our partnership with Microsoft Azure to find a path forward. They suggested we collaborate on the CAPZ project in open source to achieve feature completeness. Several engineers from the Microsoft AKS team along with engineers from our Core Infrastructure team contributed to the CAPZ project prioritizing features aligned with our production use cases. This collaboration was a huge success. It enabled us to launch our first Kubernetes cluster using Cluster API and CAPZ within three months.</p><h1>Automating Workload-Ready Clusters</h1><p>While Cluster API and CAPZ simplified cluster creation, these clusters weren&#8217;t ready for workloads yet.</p><ol><li><p>The new cluster does not have permission to access container images from Azure Container Registry (ACR). It is a reasonable design choice to leave such dependencies out of Cluster API to keep the interface generic.</p></li><li><p>An AKS cluster comes configured with a default cluster autoscaler profile. Configuring anything other than default can be done by manually running a Azure CLI command. We tune Cluster Autoscaler to achieve resource optimization and bin-packing on all of our production clusters.</p></li></ol><p>Using Terraform and running Azure CLI commands to configure these for every cluster wasn&#8217;t aligned with our principle of minimizing human intervention. So, we decided to write a companion Kubernetes operator instead. This introduces a <code>AzureClusterAdditionalConfig</code>, an extensible custom resource intended for any additional Azure managed service configurations necessary for a cluster. For ACR permissions this resolves to a <code>AzureRoleAssignment</code> object, and a <code>AzureClusterAutoscaler</code> object for custom cluster autoscaler configuration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9E6U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9E6U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 424w, https://substackcdn.com/image/fetch/$s_!9E6U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 848w, https://substackcdn.com/image/fetch/$s_!9E6U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 1272w, https://substackcdn.com/image/fetch/$s_!9E6U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9E6U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png" width="1456" height="852" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:852,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:203327,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9E6U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 424w, https://substackcdn.com/image/fetch/$s_!9E6U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 848w, https://substackcdn.com/image/fetch/$s_!9E6U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 1272w, https://substackcdn.com/image/fetch/$s_!9E6U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F985a5052-d642-4bd4-a649-512f03b449bf_1456x852.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s4xw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s4xw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 424w, https://substackcdn.com/image/fetch/$s_!s4xw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 848w, https://substackcdn.com/image/fetch/$s_!s4xw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 1272w, https://substackcdn.com/image/fetch/$s_!s4xw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s4xw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png" width="1456" height="518" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:518,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131708,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s4xw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 424w, https://substackcdn.com/image/fetch/$s_!s4xw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 848w, https://substackcdn.com/image/fetch/$s_!s4xw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 1272w, https://substackcdn.com/image/fetch/$s_!s4xw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee4c476b-d8b1-40b4-8bb2-28cd780b7019_1456x518.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qLlX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qLlX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 424w, https://substackcdn.com/image/fetch/$s_!qLlX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 848w, https://substackcdn.com/image/fetch/$s_!qLlX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 1272w, https://substackcdn.com/image/fetch/$s_!qLlX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qLlX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png" width="1456" height="666" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:666,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152009,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qLlX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 424w, https://substackcdn.com/image/fetch/$s_!qLlX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 848w, https://substackcdn.com/image/fetch/$s_!qLlX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 1272w, https://substackcdn.com/image/fetch/$s_!qLlX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a8fee42-a7db-4ff8-b08a-7de297c167c0_1456x666.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Introducing the companion operator enabled us to fully automate cluster creation and prepare them to be workload-ready by installing a single kubernetes resource using GitOps. This streamlined approach facilitated the migration of over 80 production clusters between different cloud providers.</p><h1>Automating Node Pools</h1><p>After running production workloads on a new cloud provider for a few months, we identified two major operational pain points.</p><ol><li><p>We didn&#8217;t nail the node types (instance family, disk type, etc. settings) at the first attempt. Some of these fields like machineType, diskSize, diskType, maxPod, type (spot vs regular) are immutable fields on AKS. This meant we had to replace node pools running production workloads a handful of times. Each replacement involved creating a new node pool, draining the old one, then deleting it. This process required human coordination and multiple steps in our GitOps workflow.</p></li><li><p>While updating Kubernetes version we learned that AKS&#8217;s in-place node pool upgrade tends to enter an endless retry loop when it encounters an application with 0 disruptions allowed (PodDisruptionBudget setting). Since AKS only <a href="https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/operationnotallowed">permits</a> one concurrent node pool update operation per cluster, this block operations on other node pools including manual scale up. Consequently, we had to fall back to the multi-step node pool replacement process for upgrades as well.</p></li></ol><p>We implemented a node pool Kubernetes operator, which introduces a single <code>Nodepool</code> resource encapsulating the multi-step process for replacing a node pool. Under the hood, the operator creates a new node pool, drains the old one, then deletes it in a process completely opaque from the users. From the user&#8217;s perspective all node pool manipulations are done in-place with a single GitOps change. This end-to-end automation is especially powerful during Kubernetes version upgrades.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PjJV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PjJV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 424w, https://substackcdn.com/image/fetch/$s_!PjJV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 848w, https://substackcdn.com/image/fetch/$s_!PjJV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!PjJV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PjJV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png" width="1456" height="1188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1188,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:263282,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PjJV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 424w, https://substackcdn.com/image/fetch/$s_!PjJV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 848w, https://substackcdn.com/image/fetch/$s_!PjJV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!PjJV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3bac98-3e65-43c6-867b-0e59195ecec0_1456x1188.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The only limitation was that we couldn&#8217;t link the new <code>Nodepool</code> resource to the existing hierarchy of Cluster API&#8217;s <code>MachinePool</code> or CAPZ&#8217;s <code>AzureManagedMachinePool</code> resource using <code>ownerReference</code>. Instead, these resources exist side by side and are linked using <code>objectReference</code>. This drawback becomes apparent when deleting a node pool entirely, as the resource hierarchy is removed using <a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/">finalizers</a>.</p><h1>Conclusion</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gP5W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gP5W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 424w, https://substackcdn.com/image/fetch/$s_!gP5W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 848w, https://substackcdn.com/image/fetch/$s_!gP5W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 1272w, https://substackcdn.com/image/fetch/$s_!gP5W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gP5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png" width="1456" height="534" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:534,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gP5W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 424w, https://substackcdn.com/image/fetch/$s_!gP5W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 848w, https://substackcdn.com/image/fetch/$s_!gP5W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 1272w, https://substackcdn.com/image/fetch/$s_!gP5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbad4eef-1625-4a4a-9aae-5e7a2606a94d_1914x702.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Complete cluster management stack.</figcaption></figure></div><p>Evolving our cluster management toolchain empowered us to manage twice as many clusters and applications, while sustaining the same handful of engineers on our Core Infrastructure team. This achievement marks a significant boost in operational efficiency, made possible by our steadfast commitment to GitOps principles and the elimination of human-in-the-loop processes. Along the same lines, self-healing, drift detection, etc. are first class ideas on our platform. This mindset allows us to always <em>prioritize organizational needs</em> leveraging the extensible nature of Kubernetes, while balancing efficiency, reliability and agility.</p><p>Reflecting on our journey, we've had some key learnings and missteps:</p><ol><li><p><strong>Big Bold Bets</strong>: When we explored the industry state-of-the-art for cluster management we couldn&#8217;t find anything that meets the level of automation we were striving for. At such times, we leaned on our company&#8217;s core value - Big Bold Bets. The bet on Cluster API was fruitful in the long run.</p></li><li><p><strong>Hidden cost of Open Source</strong>: Strategic partnership and driving appropriate community engagement was vital to making our open source collaboration a success. We did learn that a long term commitment is necessary to be able to effectively steer an open source project to continue to meet our needs. We remain committed to contributing to Cluster API for the foreseeable future.</p></li><li><p><strong>First Adopter Risks</strong>: A couple years ago we experienced a Sev1 incident lasting several hours where 60% of our nodes on production clusters were wiped out. We traced it to a bug in CAPZ where just the sequence number suffix was used to identify nodes instead of the full <code>spec.providerID</code>. This caused Cluster API to reference nodes from different node pools within a cluster using the same ID and in turn deleting them.</p></li><li><p><strong>Automation Enhances Reliability</strong>: Despite not being the primary focus, our automation efforts significantly improved systems reliability. We&#8217;ve successfully completed 4 incident-free Kubernetes version upgrades. Previously, there used to be at least one internal incident per version upgrade.</p></li><li><p><strong>Kubernetes Operator Pattern</strong>: We leverage Kubernetes operators extensively in our infrastructure organization. Adhering to standard design patterns like goal state reconciliation enables us to sidestep design debates, mitigate concerns about corner cases, and minimize the learning curve. This along with using managed Kubernetes distributions exclusively was the right tradeoff for us. We remain committed to this approach even today.</p></li></ol><p>So, what lies ahead for us? We've already kicked off a strategic plan to get us to the next level of scale. This involves automatically partitioning workload clusters, considering factors such as API Server pressure and node size. We've also started provisioning more single-tenant clusters. Furthermore, efforts are underway to streamline the time it takes to prepare workload-ready clusters, including steps like IP address allocation and installing cluster addons. All these initiatives are guided by our commitment to minimizing human intervention and achieving full automation. With this roadmap, we aim to adeptly manage over 500 Kubernetes clusters efficiently.</p><p><em>We would like to acknowledge C&#233;cile Robert-Michon, David Tesar, and Jack Francis at Microsoft. Each contributed during various phases of the project.</em></p>]]></content:encoded></item><item><title><![CDATA[Why Our Food Prep Time Prediction Works Better]]></title><description><![CDATA[Our prediction model improves upon estimates from delivery companies by leveraging additional prep state transitions.]]></description><link>https://techblog.atoms.co/p/food-prep-time-prediction</link><guid isPermaLink="false">https://techblog.atoms.co/p/food-prep-time-prediction</guid><pubDate>Tue, 19 Mar 2024 13:06:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dLbN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dLbN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dLbN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 424w, https://substackcdn.com/image/fetch/$s_!dLbN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 848w, https://substackcdn.com/image/fetch/$s_!dLbN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 1272w, https://substackcdn.com/image/fetch/$s_!dLbN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dLbN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png" width="1456" height="764" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:764,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dLbN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 424w, https://substackcdn.com/image/fetch/$s_!dLbN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 848w, https://substackcdn.com/image/fetch/$s_!dLbN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 1272w, https://substackcdn.com/image/fetch/$s_!dLbN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0dd6686-73f8-49be-9958-3e4624239bd3_1600x840.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>This post was written by Zhuyi Xue and Suming J. Chen, who led the development of prep time prediction.</em></p><p>Accurate food prep time prediction has a variety of use cases for food delivery orders. For instance, delivery service providers (DSPs) like <a href="https://www.ubereats.com/">UberEats</a> and <a href="https://www.doordash.com/">Doordash</a> need prep time prediction in order to provide estimates for when food will be delivered to their customers, and to better determine when to dispatch couriers to pick up the food. While DSPs have been routinely predicting food prep time, CloudKitchens has an advantage in such prediction because we have more comprehensive insight into kitchen operations. We have deployed our own prep time prediction model that has led to more streamlined logistics (i.e. customers get fresher food and couriers wait less) and fewer customer complaints.&nbsp;</p><p>In this post, we show how we predict prep time for kitchens at CloudKitchens facilities and describe the engineering challenges and product decisions made throughout the process.</p><h1>Food prep time prediction</h1><h2>Problem Description</h2><p>For each order, kitchens are required to send prep estimates to DSPs. These estimates are not only used by DSPs, but also by internal facility staff to coordinate logistics. Initially, this value was manually set and usually fixed (e.g. 0/5/10 minutes) &#8211; regardless of how busy the kitchen is or what items the order consists of, leading to logistics inefficiencies and customer complaints.</p><p>Intuitively, we know how long an order takes to complete depends on a variety of factors (kitchen busyness, order size), but it&#8217;s not feasible for kitchen staff to accurately and consistently provide order-specific prep time estimates.</p><p>At CloudKitchens, we have valuable kitchen data &#8211; real-time monitoring of every kitchen&#8217;s current orders-in-progress as well as when each order is complete. Leveraging this data allows us to develop a machine learning (ML) solution to predict the prep time that significantly outperforms heuristic approaches.</p><h2>Methodology</h2><p>Following the <a href="https://developers.google.com/machine-learning/guides/rules-of-ml">best practices for ML engineering</a>, we did not adopt a ML solution right away at the beginning, and instead just used the median prep time of historical data as the prediction for each kitchen to ensure the infrastructure (e.g. training pipeline, gRPC service, metrics, monitoring) was set up correctly (Figure 1).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YyEx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YyEx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 424w, https://substackcdn.com/image/fetch/$s_!YyEx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 848w, https://substackcdn.com/image/fetch/$s_!YyEx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!YyEx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YyEx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png" width="1408" height="1184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1184,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YyEx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 424w, https://substackcdn.com/image/fetch/$s_!YyEx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 848w, https://substackcdn.com/image/fetch/$s_!YyEx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!YyEx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c078f5-e1bd-4d71-9c0b-f692959ae02b_1408x1184.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Architecture of the prep time model, for both model training and serving.</figcaption></figure></div><p>After an end-to-end heuristic model was deployed, we then switched to a linear model with a set of simple features that can be derived from the order itself, e.g. number of items in the order, subtotal of the order, time-of-day of the order. Following, we moved onto a gradient boosting model using <a href="https://lightgbm.readthedocs.io/en/stable/">LightGBM</a>. We chose LightGBM for its specific way to handle <a href="https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support">categorical</a> features (e.g. kitchen ID), which tends to outperform <a href="https://en.wikipedia.org/wiki/One-hot">one-hot-encoding</a>-based approaches. We also tried a neural network model, but it did not outperform gradient boosting in our experiment, so we decided to shift focus to engineering more advanced features.&nbsp;</p><p>A couple of such features that improve model performance significantly are</p><ol><li><p>Average prep time of trailing orders (<code>avg_trailing_prep_time</code>) for a kitchen. The trailing orders are the most recently completed orders before the order we&#8217;re predicting prep time for.</p></li><li><p>Total number of orders being <em>currently</em> prepared from all DSPs in the kitchen (<code>num_queued_orders</code>), a proxy for kitchen busyness.&nbsp;</p></li></ol><p>Both features are available to CloudKitchens kitchens but hidden from DSPs. What distinguishes them from the simple features is that they&#8217;re highly dynamic as they depend on the states of other orders from the same kitchen. The <code>avg_trailing_prep_time</code> feature needs to be updated whenever an order is finished cooking and the <code>num_queued_orders</code> feature needs to be updated whenever a new order comes in or an existing order is finished cooking. Such dynamic nature poses challenges to ensure feature consistency between model training and serving times, which will be further discussed in the Engineering Challenges section.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u51J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u51J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 424w, https://substackcdn.com/image/fetch/$s_!u51J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 848w, https://substackcdn.com/image/fetch/$s_!u51J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 1272w, https://substackcdn.com/image/fetch/$s_!u51J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u51J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png" width="414" height="457.24" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cba703b4-9864-4f07-ba00-68badfb059df_900x994.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:994,&quot;width&quot;:900,&quot;resizeWidth&quot;:414,&quot;bytes&quot;:54733,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!u51J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 424w, https://substackcdn.com/image/fetch/$s_!u51J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 848w, https://substackcdn.com/image/fetch/$s_!u51J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 1272w, https://substackcdn.com/image/fetch/$s_!u51J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcba703b4-9864-4f07-ba00-68badfb059df_900x994.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2. Progression of model and feature complexity.</figcaption></figure></div><p>As we progress towards more complex models and features (Figure 2), model performance consistently improves. Compared to the manual predictions, our ML-based solution improves the mean absolute error (MAE) by <strong>42%</strong>. We choose MAE over mean squared error (MSE) as the main metric for its robustness against outliers.</p><h2>Engineering challenges</h2><h3>Feature consistency</h3><p>The calculation of features like <code>avg_trailing_prep_time</code> and <code>num_queued_orders</code> requires tracking of all recent order states in a kitchen. Initially, at training time, we calculated such features in Python, while at serving time, the model client was written in Java. As a result, the same logic was implemented twice in different languages, which almost always led to some feature skew between training and serving time, and was unscalable for adding additional features.</p><p>To deal with the feature skew issue, we decided to use a feature store. We evaluated multiple open-source solutions for feature store implementations, but found that there were drawbacks to each approach, e.g. they could not integrate with the multi-regional technologies CloudKitchens uses, or their lack of support in SDK for our stack.</p><p>In addition, we recognized the need for a system that could handle the complex interdependencies between various data features due to their highly dynamic nature. Specifically, we required a setup where an update to one feature could automatically trigger updates to related features across different entities. For example, when a kitchen completes an order faster than anticipated, this event could alter our prediction of preparation times for pending orders in that kitchen, which might also affect the kitchen's <a href="https://melrosefoodco.com/">direct-to-consumer ranking</a>, which is also considered a feature, compared to others. To address this, we built our own feature store using <a href="https://www.cockroachlabs.com/">CockroachDB</a> (Figure 3), which was chosen for its resiliency and scalability. This feature store was designed to support the cascading updates of feature values that our data environment demanded.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mgxS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mgxS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 424w, https://substackcdn.com/image/fetch/$s_!mgxS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 848w, https://substackcdn.com/image/fetch/$s_!mgxS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 1272w, https://substackcdn.com/image/fetch/$s_!mgxS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mgxS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png" width="1456" height="780" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:780,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86205,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mgxS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 424w, https://substackcdn.com/image/fetch/$s_!mgxS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 848w, https://substackcdn.com/image/fetch/$s_!mgxS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 1272w, https://substackcdn.com/image/fetch/$s_!mgxS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec3c9ceb-6ce5-43a2-9f69-1c59319d55c8_1568x840.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3. A typical event-drive feature store versus the feature store we built.</figcaption></figure></div><h3>Automatic model refresh</h3><p>As CloudKitchens expands rapidly, new kitchens continue to be added, so the prep time model needs to be retrained periodically with the most recent data to account for them. In addition, the evolution of prep time patterns in existing kitchens driven by events like cooking appliance upgrades/additions, menu updates, and staff churn also requires periodic model refreshes.</p><p>To make model refresh scalable to additional logistics models, we implemented an automatic model refresh strategy (Figure 4). Suppose the incumbent model (M<sub>i</sub>)<sub> </sub>is trained on data between t<sub>0</sub> and t<sub>3</sub>, and some new data is collected between t<sub>3</sub> and t<sub>5</sub>. To train and evaluate a new model, we follow these steps:</p><ol><li><p>Split the data between t<sub>1</sub> and t<sub>5</sub> into three parts: fit set (t<sub>1</sub>-t<sub>2</sub>), validation set (t<sub>2</sub>-t<sub>4</sub>), test set (t<sub>4</sub>-t<sub>5</sub>). Note, we keep t<sub>5</sub> - t<sub>3</sub> = t<sub>1</sub> - t<sub>0</sub> to avoid including very old data for model training because our research shows that, contrary to the common impression that the more data the better, they hurt model performance when evaluated on the most recent orders, which indicates that prep time patterns evolve quickly.</p></li><li><p>Fit a model on the fit set, validate it on the validation set, conduct hyperparameter tuning, and pick the best hyperparameters (HP).</p></li><li><p>Train a candidate model for comparison (M<sub>c</sub>) on the fit + validation set with HP.&nbsp;</p></li><li><p>Calculate the metrics for both M<sub>i</sub> and M<sub>c</sub> on the test set, and compare them.</p></li><li><p>If M<sub>c</sub> outperforms M<sub>i</sub>, train a new production-ready model M<sub>p</sub> on fit + validation + test set with HP.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!08Gq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!08Gq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 424w, https://substackcdn.com/image/fetch/$s_!08Gq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 848w, https://substackcdn.com/image/fetch/$s_!08Gq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 1272w, https://substackcdn.com/image/fetch/$s_!08Gq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!08Gq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png" width="1408" height="738" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:738,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!08Gq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 424w, https://substackcdn.com/image/fetch/$s_!08Gq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 848w, https://substackcdn.com/image/fetch/$s_!08Gq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 1272w, https://substackcdn.com/image/fetch/$s_!08Gq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F627ff3ba-cf56-48d6-a661-20baf0286c46_1408x738.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4. Vertical arrows indicate timestamps used when designing the auto refresh strategy. Horizontal arrows label the time interval between two timestamps. Rectangles indicate the data used for training the incumbent or the new model.</figcaption></figure></div><p>Note, the new data collected between t<sub>3</sub> and t<sub>5</sub> must be split into two parts because 1) if all of it is used for training, then we have no data for testing; or 2) if all of it is used for testing, then the new model can only be trained on the same or a subset of data used for training the incumbent model, and will unlikely show any improvement.</p><p>Independent of model training, the model service attempts to load the latest production-ready model in a cron-like fashion so that it can make use of M<sub>p</sub> shortly after it becomes available. We also have monitoring in place to alert if the incumbent model gets too old for unexpected reasons.</p><h1>Product Decisions</h1><h3>Quantile prediction</h3><p>Some of our products require quantile predictions besides a point estimate of the mean/median prep time. For instance, our decision-making algorithm for deciding when to send a scheduled order to the kitchen for preparation is based on p25 and p75 quantiles of the prep time estimates. To accommodate the additional quantiles, we have opted for <a href="https://en.wikipedia.org/wiki/Quantile_regression">quantile regression</a> for model training, which uses a so-called pinball loss:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zwJB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zwJB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 424w, https://substackcdn.com/image/fetch/$s_!zwJB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 848w, https://substackcdn.com/image/fetch/$s_!zwJB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 1272w, https://substackcdn.com/image/fetch/$s_!zwJB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zwJB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png" width="546" height="67.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:180,&quot;width&quot;:1456,&quot;resizeWidth&quot;:546,&quot;bytes&quot;:110681,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zwJB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 424w, https://substackcdn.com/image/fetch/$s_!zwJB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 848w, https://substackcdn.com/image/fetch/$s_!zwJB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 1272w, https://substackcdn.com/image/fetch/$s_!zwJB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1804b0f6-e2b2-4e16-83be-1a781780d876_3096x382.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where t is the ground truth label, <strong>x</strong> is the feature vector, y is the model, I is the indicator function, and &#964; &#8714; (0, 1) is the alpha corresponding to the quantile interested. When &#964; = 0.5, pinball loss is equivalent to MAE loss. We evaluate our quantile regression models by checking their average calibrations, e.g. a p25 model should produce predictions that are lower than the ground truth labels about 25% of time. While exploring alternatives, we also considered <a href="https://github.com/stanfordmlgroup/ngboost">NGBoost</a>, which generates a probabilistic forecast based on a pre-assumed parametric distribution (e.g. lognormal). However, NGBoost took significantly longer to train a model on our dataset, rendering it less practical. Besides being more scalable, quantile regression offers the advantage of not imposing any assumptions about the distribution form. A drawback of this approach is the need to train separate models for each quantile, so p25, p50 and p75 correspond to three distinct LightGBM models. Given our focus on only a few quantiles, quantile regression remains a suitable choice. We view quantile regression as a compromise between point estimation and fully probabilistic forecasting.&nbsp;</p><h3>Remaining prep time prediction</h3><p>When an order is first received, we always make an initial prep time prediction to return to the DSP. However, our belief in this prep time is fluid, meaning that over time there are events that may change our initial estimate. For instance, we may have initially predicted that an order would take 15 minutes, but 10 minutes into food prep the kitchen may have become much busier, so we would be able to take into account the updated state and make a more accurate prediction for the remaining prep time. Our motivation for this is to ensure that at any point in time, the most accurate estimate is available &#8211; important for use cases like planning robot routing for order conveyance as well as keeping couriers updated on the pick up time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fNBD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fNBD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 424w, https://substackcdn.com/image/fetch/$s_!fNBD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 848w, https://substackcdn.com/image/fetch/$s_!fNBD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 1272w, https://substackcdn.com/image/fetch/$s_!fNBD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fNBD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fNBD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 424w, https://substackcdn.com/image/fetch/$s_!fNBD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 848w, https://substackcdn.com/image/fetch/$s_!fNBD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 1272w, https://substackcdn.com/image/fetch/$s_!fNBD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cbb9e45-e61a-496c-a287-6a0be6c52874_1984x586.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 5. Horizontal bars show prep time of each order. Blue shows the order we are sampling training examples from. Gray shows other orders from the same kitchen. Vertical arrows show the timestamps we sample at, e.g. t1 and t4 are sampled when an existing order is finished cooking, and t2, t3, t5, t6 are examples sampled when a new order is received.</figcaption></figure></div><p>We solve this problem by:</p><ol><li><p>Adding an additional feature that measures how much time has passed since an order is received (<code>time_since_order_received</code>). As illustrated in Figure 5,&nbsp; at t<sub>0</sub>, the <code>time_since_order_received</code> is 0, and at t<sub>1</sub>, it is t<sub>1</sub> - t<sub>0</sub>, and so on.</p></li><li><p>Sampling additional training examples whenever there is a change in kitchen state. An example is sampled whenever a new order arrives (t<sub>2</sub>, t<sub>3</sub>, t<sub>5</sub>, t<sub>6</sub>) or an existing order is being completed (t<sub>1</sub>, t<sub>4</sub>), so seven examples are obtained from the order instead of one in Figure 5. Using the aforementioned feature store, it is straightforward to sample the training data at different timestamps within an order&#8217;s prep time.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!76cr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!76cr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 424w, https://substackcdn.com/image/fetch/$s_!76cr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 848w, https://substackcdn.com/image/fetch/$s_!76cr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 1272w, https://substackcdn.com/image/fetch/$s_!76cr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!76cr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png" width="1230" height="825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:825,&quot;width&quot;:1230,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114828,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!76cr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 424w, https://substackcdn.com/image/fetch/$s_!76cr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 848w, https://substackcdn.com/image/fetch/$s_!76cr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 1272w, https://substackcdn.com/image/fetch/$s_!76cr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb6cda3a-a416-45af-995c-39bd6b1467df_1230x825.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 6. Remaining prep time prediction prevents zero or negative prediction.</figcaption></figure></div></li></ol><p>As shown in Figure 6, at any point in time, we <em>could</em> simply estimate the remaining prep time by subtracting the initial predicted prep time (~20min) by elapsed time (orange line). However, we can see that 20 minutes in, the only option is to continually show 0 minutes remaining or show negative numbers. Instead, if the remaining prep time is used, we can continue to finesse the prediction all the way until the order is complete (red line). The remaining prep time model is shown to be 10% more accurate than the baseline approach.</p><h1>Conclusion</h1><p>We describe how we leverage the unique data we have in order to predict food prep time 42% more accurately than the naive baseline.&nbsp; We discuss our approach to engineering challenges like train/serve feature consistency, automatic model refresh, and how we enabled this feature to be even more powerful by adding on quantile predictions and the ability to predict the remaining prep time. In the future, with the addition of new technologies and increased capacity to capture the state of the kitchen, we expect to continue to improve model quality.</p>]]></content:encoded></item></channel></rss>