How I built OctoStudio powered by OctoAI
We launched the OctoAI Image Gen Solution this week, the fastest and most customizable GenAI stack for production-grade image generation applications. One application area where we’re already seeing lots of interest is e-commerce – specifically generating high quality product images — and I thought it would be fun to build a showcase of how OctoAI can power e-commerce use cases like product placement. In this blog, I’m going to walk you through how I built OctoStudio, a sample app that generates 100s or even 1000s of custom images of your product, in minutes.
I took inspiration from AI-powered creative tooling for advertisers and retailers that came from Amazon and Google. These new tools help small businesses more easily connect with online shoppers by harnessing the power of Generative AI to customize product photography.
But I wanted to take the idea a notch further. What if instead of just simple background replacement, we had the opportunity to fully leverage the flexibility of GenAI models (e.g. SDXL, LLAMA) to place the product under highly custom situations? What if I display the product under different angles? What if I could have very fine control over the person who is wearing the item? Or over what’s displayed in the background of the photo shoot? And what if I could build this based on open models and technologies, with full control to change the models over time and no restrictive lock-in? That would be pretty cool.
The key objective behind this exercise was to create a blueprint for a modern, multimodal GenAI e-commerce app. One that delivers a high degree of customization towards a target demographic and time of the year / cultural event for which the product is being marketed. Out of a small collection of photographic assets, I can achieve asset amplification and specialization to a degree never delivered before. And all of this with production-grade speed, scalability and reliability out of the box.
We will use a number of generative AI capabilities under the covers, including the SDXL model, fine-tuning, and the Llama 2 large language model. But before we start — let me give you a sneak preview of a sample outcome:
These are images of “my product” (the branded virtual reality headset) placed on a person in multiple background settings and taken at multiple angles, all powered by generative AI. Let me now walk you through the steps in building this application.
Customization - create your fine-tuning assets and add it to the asset library
The first step on this journey is to create your fine-tuning assets — the building blocks you use to customize your images.
Let’s start by looking at Stable Diffusion XL, the gold standard for text to image generation. Here I’m asking SDXL to generate an image of someone wearing a branded VR headset. As you can see, SDXL out of the box really struggles with this — that’s likely because that specific product wasn’t in its training set.
In order to teach the model how to generate images of our product, I can train a Low-Rank Adaptation (LoRA) of SDXL. LoRA was initially conceived for fine-tuning large language models, but has since been adopted in other areas like image generation. You can think of the LoRA as a minor software patch that you apply to your large model. With the OctoAI Image Gen Solution you’ll be able to create and apply your own LoRA fine-tunes by providing a small set of images featuring your product under different angles.
Once the LoRA is trained, I can apply it to SDXL via OctoAI’s asset orchestrator: this lets us now generate images featuring the branded VR headset. Let’s take a look at the results. The difference is like night and day!
You can train a new LoRA for each product that you want to showcase. I’ve trained a second one for the OctoAI shirt we used at the launch of the OctoAI compute service in June.
The amazing thing is, even though this OctoML shirt LoRA was only trained on adults, I can now (via the flexibility of SDXL) generate the shirt in kid’s sizes too! This is so much more powerful than performing manual background replacement.
Power of the platform — mix and match leading GenAI models to build your experience
You may think that OctoAI only offers APIs to text to image models like SDXL, and that couldn’t be further from the truth! OctoAI offer an extensive array of other GenAI models that you can use, including the Llama 2 model family.
Specifically you can build very powerful workflows when you combine say a text to image model like SDXL with an LLM like Llama 2. For OctoStudio, I thought a Llama 2 would be really neat to rely upon in order to obtain background suggestions for product photography according to a given theme.
See how Llama 2 thinks fast on its feet when I ask it to generate 5 different backdrop suggestions for Holi, a “festival of colors” from the Indian subcontinent. Typically these are prompts I’d have to think up of and then manually enter into SDXL. By relying on Llama 2, I can quickly and easily perform an exploration of various generation ideas based on a single theme — Holi. These backdrop suggestions, as you can see, are quite on point. And this reliance on Llama 2 really accelerates the process of creating a gallery with minimal user intervention.
By composing with SDXL and Llama 2 I’ve created a unique “model cocktail” that can work across data of different modalities.
SDXL + LoRA lets us generate the initial set of pictures featuring the product worn by a specific model, which the user gets to specify (e.g. Indian male in his 30s).
Llama 2 then suggests different backgrounds to cycle through, based on a theme that the user specified, (e.g. “Holi”).
With that list of backgrounds suggestions and series of initial photos in hand, we perform image amplification via background replacement using a GenAI pipeline similar to one that Amazon or Google are using to perform image segmentation, and in-painting with SDXL and a depth ControlNet.
As a developer, you too can think of and chain together your unique model cocktail here for a highly tailored GenAI multimodal pipeline that works for your specific needs and for your product placement outcomes.
Truly unlocking GenAI — from 10s to 100s to 1000s of products
I wanted to not just show a one-off demo but really inspire developers to build tooling that can scale and serve to the most demanding e-commerce companies. When the cost of generating a highly custom photo drops to a fraction of a penny, this drastically changes the way we think of product photography.
But to take advantage of that low cost, and take product photography asset generation to another level (say an e-commerce giant wanting to offer this service to their 10s or 100s of thousands of retailers), you need a solid and scalable infrastructure to build on top of.
That’s what I wanted to showcase in action — the OctoShop being able to generate a gallery of dozens of high quality images in less than a minute. Mind you there are a lot of models involved here that have to be sequentially invoked too, so achieving this level of speed is nothing short of impressive. This all built on OctoML’s deep expertise in AI systems, combined with extensive engineering work the team has done to ensure speed and reliability in GenAI model execution.
I’ll leave it to the developers out there and their unbounded imagination and creativity to build the next wave of e-commerce product photography tooling, I can’t wait to see what the future holds!
Get started today
You’re also welcome to join us on Discord to engage with the team and our community. We’ll use the Discord channel to share about upcoming features, promotions and competitions. Stay tuned to learn more, and I look forward to see the applications and imagery you build using GenAI.