My stuff
Collaborative Control for Geometry-Conditioned PBR Image Generation
5 months ago·arXiv

Current 3D content generation approaches build on diffusion models that output RGB images. Modern graphics pipelines, however, require physically-based rendering (PBR) material properties. We propose to model the PBR image distribution directly, avoiding photometric inaccuracies in RGB generation and the inherent ambiguity in extracting PBR from RGB. Existing paradigms for cross-modal fine-tuning are not suited for PBR generation due to both a lack of data and the high dimensionality of the output modalities: we overcome both challenges by retaining a frozen RGB model and tightly linking a newly trained PBR model using a novel cross-network communication paradigm. As the base RGB model is fully frozen, the proposed method does not risk catastrophic forgetting during fine-tuning and remains compatible with techniques such as IPAdapter [76] pretrained for the base RGB model. We validate our design choices, robustness to data sparsity, and compare against existing paradigms with an extensive experimental section.

Keywords: Image Generation, Material Properties, Multi-Modal Generation, Physically-Based Rendering

The recent meteoric rise of diffusion models has made automated at-scale generation of high-quality RGB image content more accessible than ever. Continuing on this success, Text-to-Texture and Text-to-3D approaches successfully lifted the generation to 3D [35]. Yet to maximize the usefulness of the generated textures in downstream 3D workflows, they must be compatible with physically-based rendering (PBR) pipelines for proper shading and relighting. Current approaches to PBR texture generation rely on generated RGB images and subsequent PBR extraction through an inverse rendering process, facing physically inaccurate lighting in the generated RGB diffusion images as well as significant ambiguities in the inverse rendering. We propose a solution for geometry-conditioned generation of PBR images by modeling the joint distribution directly, avoiding the issues around photometric consistency and inverse rendering.

To model the distribution of non-RGB modalities, existing approaches typically fine-tune the weights of a base RGB model. Applied to PBR images, this implies either directly predicting the entire PBR image stack or sequentially predicting them conditioned on one another. Neither is sufficient for our use-case: jointly predicting the entire PBR image stack is problematic as the higher-dimensional modality does not compress well into the established latent spaces, and sequentially predicting the elements of the PBR image stack is significantly more expensive and risks compounding errors in the sequential generation. Furthermore, while state-of-the-art RGB diffusion models are trained on billions of images [54], there is unfortunately no dataset of such size at our disposal for PBR content generation. Instead, the largest available dataset of PBR content is Objaverse [9], containing around 800,000 objects with associated PBR textures. In light of the restricted training data available, fine-tuning the base model results in catastrophic forgetting, forfeiting generalizability, as we illustrate in Sec. 5.

Instead, we keep the pre-trained RGB image model frozen and train a parallel model to generate the PBR image stack, as shown in Fig. 2. Using our proposed cross-network control paradigm, we tightly link the PBR model to the frozen RGB model in order to leverage its expressivity and rich internal state. As a result we are able to generate qualitative and diverse PBR content, even far out-of-distribution for the Objaverse dataset. Crucially, the frozen RGB model safeguards against catastrophic forgetting and remains compatible in a plug-and-play fashion with techniques such as IPAdapter [76]. In summary, we: 1. Propose the novel Collaborative Control paradigm to tightly link the PBR

generator to a fully frozen pre-trained RGB model, modeling the joint distri-

bution of RGB and PBR images directly (see Sec. 4.1),

2. Illustrate that the proposed control mechanism is data-efficient, and generates high-quality images even from a very restricted training set,

3. Ablate our design choices to show the improvement over existing paradigms in literature and the issues with existing approaches, and

4. Demonstrate the compatibility with IPAdapter [76] specifically.


Fig. 2: Collaborative Control. Two parallel models collaborate to generate pixelaligned outputs of different modalities. We freeze the left pre-trained RGB model and train the right PBR model with its cross-network communication layers. Communication is achieved by concatenating the states of both models, processing them with a single linear layer for a residual update, and distributing the result back to the respective models. As discussed in Sec. 5, prompt cross-attention in the PBR model is counterproductive.

Generating natural images from text prompts. Natural image generation has a long history: from GANs [14, 2528] and VAEs [30], to autoregressive models [47,66] and more recently diffusion models [18,59]. Although conditioning GAN networks on text prompts or other embeddings has proven difficult [50], recent work continues to improve GAN-based text-to-image generation [53]. The introduction of diffusion models [18,59], which model the generation process as the iterative inverse of a degeneration process [58], was a breakthrough in the generative field — it is far more stable than typical GAN training, albeit slow and computationally expensive.

Unfortunately, these approaches require billions of images to train from scratch [54]. For PBR image generation, the largest commonly available dataset is Objaverse [9]: at 800,000+ objects it is still several orders of magnitude smaller than LAION-5B [54] and proves insufficient to train high-quality generative models from scratch (as illustrated by the fine-tuning baseline failing to generalize in Sec. 5). While pre-trained RGB models encode rich prior knowledge around structure, semantics, and materials [11,55,61], Sarkar et al. warn that the models are often still inaccurate when it comes to structure [52]. We argue that this extends to material properties: diffusion models prefer idealized and artistic appearances, rather than physically accurate scenes. Therefore, we now discuss how to extract relevant knowledge out of the powerful pre-trained RGB models. Leveraging pre-trained RGB models to generate other modalities. The authors of Marigold [29] interpret depth maps as grayscale images to leverage a pre-trained LDM and its linked VAE, while the authors of DiffusionDepth [12] train a VAE and diffusion model from scratch. Alternatively, Zhao et al. [83] extract multiscale feature maps from a pre-trained diffusion model’s activations to decode for a variety of tasks, including depth prediction. Alternatively, LDM3D [60] shows that it is possible to compress the joint RGBD image into the original latent space’s dimension, roughly preserving the latent space so that fine-tuning the base diffusion model for RGBD diffusion is tractable. Alternatively, some works train LoRAs [19] to extract extra modalities one at a time from a pre-trained diffusion model [11,34]. None of these approaches are plausible for PBR image generation: compressing the PBR data together with RGB in the original latent space overloads the capacity of that low-dimensional latent space, and splitting the PBR image stack up into separate triplets that the diffusion model can directly predict is costly as they need to be generated sequentially to properly model the joint distribution. Wonder3D [42] and UniDream [41] perform joint RGB and normal diffusion using a cross-domain self-attention aligning the two parallel branches; yet the number of diffusion models scales linearly, and the cross-domain self-attention quadratically, with increasing number of output modalities. Our proposed approach differs in two key aspects from these multi-modal paradigms: (1) we do not fine-tune the base RGB model at all, reducing the risk of catastrophic forgetting as much as possible, and (2) we train a single parallel branch for all additional modalities jointly, reducing cost. For the latent encoding, we train a PBR-specific VAE to encode all PBR channels jointly.

Image-based conditioning of diffusion models is relevant to us for two reasons: the similarity between conditioning paradigms and our proposed Collaborative Control, and the fact that we tackle geometry-conditioned PBR image generation. Existing pixel-accurate control techniques come in two flavors: re-training of the base model with modified input and output spaces [12,29], and training of a parallel model that affects the base model’s state [10,20,82]; we have found that the former risks catastrophic forgetting of the base model’s expressiveness and quality. Similar to the latter techniques, we leave the base model’s weights fully frozen and residually consume and edit the internal states from a parallel model. In ControlNet-XS [10], the authors investigate the connectivity between the base model and the control model, concluding that a connection from the base model’s encoder to the controlling model’s encoder, and from there to the base decoder, provides sufficient information flow for optimal performance. Yet in both ControlNet and ControlNet-XS, the controlling model only influences the base RGB model’s output. In AnimateAnyone [20], the parallel model hooks into the state of the base RGB model to output its own new RGB image instead. For the purpose of our joint RGB-PBR diffusion model, our PBR branch both controls the base model (to keep it aligned during the iterative generation process), and generates its own PBR output based on the RGB model’s internal state; therefore, our proposed approach requires full bidirectional connections between both branches. Figure 5 contains a schematic overview of these methods. To condition on input geometry we adopt the choice by Ke et al. [29] of concatenating the conditioning to the PBR model’s inputs. As we are training the PBR model from scratch in any case, this does not incur additional cost.

Text-to-3D describes the task of generating full 3D models from text prompts, often with the aim to support downstream graphics pipelines such as game engines. Earlier methods leverage a pre-trained RGB diffusion model to extract direct appearance, typically leveraging Score Distillation Sampling [48] (SDS) to iteratively optimize a 3D representation by backpropagating the diffusion model’s noise predictions [15,22,36,38,43,44,57,63,64,6971,74,84,85], or building on viewpoint-aware image models [21,39,40,42,49,56,79] to perform direct fusion. Other authors have investigated distilling a pre-trained image RGB model into a 3D generative technique [45,68]. These RGB methods ignore view-dependency of objects, often resulting in artifacts around highlights, and do not result in a representation that is useful in graphics pipelines. Work that does generate PBR properties does so by backpropagating the denoised predictions through a differentiable renderer [7,41,73,75,77]. For such methods, and inverse rendering in general, a major concern is lighting being baked into the material channels: HyperDreamer [73] introduces an ad-hoc segmentation [31]-based regularization loss on the albedo channel to reduce artifacts. By directly generating PBR content rather than going through an inverse rendering process, our proposed technique could resolve many of the issues related to the inverse rendering in the latter methods while retaining the simplicity of the former methods.

Text-to-Texture methods restrict the Text-to-3D problem to objects with known structure by conditioning the diffusion model on the object geometry [4, 6,32,33,78,80,81]. Paint3D [80] also discusses the lighting artifacts typical with inverse rendering and introduces a custom post-processing diffusion model to alleviate these. Here, too, our proposed approach directly models and generates the full PBR image distribution, and could help these technique side-step the issues with inverse rendering completely.

Evaluation metrics for generative methods compare the output distributions with known ground-truth distributions. This is typically done with the Inception Score [51] and the Fréchet Inception Distance [17], which compare hidden state distributions of Inceptionv3 [62] between both image sets because directly modeling the distributions of the high-dimensional images is not tractable. The authors of CMMD [23] argue that neither of these metrics is well suited to modern generative models, and propose a new metric that compares the distributions of CLIP embeddings of the generated images, showing that it aligns better with human observers, especially in respect to low-level image degradations.

Aside from comparing the modelled distributions with the ground truth distributions, we also wish to evaluate the general quality of the generated images for text prompts that are out of distribution for the training dataset. The CLIP score [16] compares the CLIP image embedding with the embedding of the prompt, indicating how well the text prompt was followed: whether all the relevant elements are represented and whether no extraneous elements were introduced. We also report the OneAlign aesthetics and quality metrics of the generated images [72], which have been shown to align well with human perception, to provide a more quantitative indication of quality.


Fig. 3: Bump map. Similar surface bumps in world space (left) are dissimilar in the UV tangent space (middle) because of the arbitrary UV mapping. Representing the bump map in a tangent space solely dependent on the geometry (right) resolves this issue.

Fig. 4: Rendering function. The dataset is constructed so that the lighting remains constant with respect to the camera, simplifying the rendering function  fRGB: no-tice the similar highlight location.

PBR materials are a compact representation of the bidirectional reflectance distribution function (BRDF), which describes how light is reflected from the surface of an object. We use the popular Cook-Torrance analytical BRDF model [8], using specifically the Disney BRDF Basecolor-Metallic parametrization [3] as it inherently promotes physical correctness. In this parametrization, the BRDF comprises  Albedo (ba ∈ R3), Metallic (bm ∈ R), and Roughness (br ∈ R) components. To increase realism during rendering beyond the resolution of the underlying geometry, graphics pipelines add small details such as wood grain or grout between tiles by encoding them in an additional  bump map (bn ∈ R3). As this bump map is typically defined in a tangent space based on an arbitrary UV-unwrapping, it entangles the surface property with this arbitrary UV mapping. Instead, we propose to predict the bump map defined in a tangent space based solely on the object geometry, disentangling the texture from the UV mapping as shown in Fig. 3. To construct this geometry tangent space for a point  p = [px, py, pz]T with geometry normal n, we construct the local tangent vector as  t = n × ([−py, px, 0]T × n), corresponding to Blender’s Radial Z geometry tangent. The geometry tangent space is then constructed as  (t/∥t∥, n × t/∥t∥, n)T .

Diffusion models [18,58] iteratively invert a forward degradation process to generate high-quality images from pure noise (typically white Gaussian noise). Formally, the forward process iteratively degrades images from the data distribution  z0 ∼ p(z)to standard-normal samples  zT ∼ N(0, I)over the course of T degradation steps as  zt ∼ N(αtzt−1, (1 − αt)I), where  αtdenotes the noise schedule for timestep t. Practically, the forward process can be condensed into the direct distribution  zt ∼ N(√¯αtz0, (1 − ¯αt)I)with the appropriate choice of ¯αt. The diffusion model D is trained to sample the stochastic reverse process Dt(zt) ∼ p(zt−1|zt)to iteratively generate  z0 from zT .

We wish to train a PBR diffusion model  Dpbrthat models the reverse denoising process for PBR images as represented in the latent space of a VAE [50], representing the data distribution  p(zpbr). We find that we lack the data required to train this model directly, and instead propose to model  p(z′rgb := frgb(zpbr), zpbr)based on an RGB diffusion model  Drgbfor the RGB data distribution  p(zrgb); frgbis a rendering function that projects the PBR images onto the RGB domain. To motivate this, we split the joint reverse process into two separate processes:


The RGB model is implemented based on  Drgb(zrgb,t−1) ∼ p(zrgb,t−1|zrgb,t): byadjusting the internal hidden states based on the PBR model’s state, we align the current RGB sample with the PBR sample and restrict it to  Im(f)1. To simplify this alignment problem, the rendering function  frgbuses fixed camera settings and a fixed environment map as shown in Fig. 4. The PBR model now no longer models  p(zpbr,t−1|zpbr,t): it additionally has access to the RGB context (z′rgb,t−1, z′rgb,t)which simplifies the problem. The RGB and PBR models are in practice much more intertwined than Eq. (1) implies: this derivation serves mostly as an intuitive indication for why the joint problem is more tractable. Note that  z′rgb,tis a degraded version of  z′rgb,0, and not a rendered version of zpbr,t: the PBR model does not learn to do inverse rendering in degraded image space but rather learns to denoise PBR images given additional RGB context.

4.1 Collaborative Control

In summary, our proposed approach comprises two models working in tandem: a pre-trained RGB image model and a new PBR model, tightly linked to one another (see Fig. 5 for a high-level overview of our proposed control scheme). The previous section identifies two tasks for this cross-network communication: aligning the RGB model’s output with both the PBR model’s output and the map of the rendering function Im(f), and communicating knowledge in the RGB model to the PBR model. ControlNet [82] and ControlNet-XS [10] discuss solutions to the former control problem — the authors conclude that communication from the base model’s encoder to the controlling model’s encoder, and from the controlling model’s decoder to the base model’s decoder, is sufficient. AnimateAnyone [20] addresses the latter problem and concludes that, there, unidirectional communication from the left model to the right model is sufficient. We have found that for our problem, full bidirectional communication between both networks is crucial; we dub this Collaborative Control. See Fig. 5 for a visualization of these control schemes.


Fig. 5: High-level overview of communication paradigms in (a) ControlNet [82], (b) ControlNet-XS [10], (c) AnimateAnyone [20] and (d) our proposed Collaborative Control approach. Blue represents frozen blocks, while orange elements are optimized during training.

We implement the cross-network communication as a connecting layer between the two models after every self-attention module; its inputs are the concatenation of the model states and its outputs are residually distributed to both models again. During training, we only optimize the weights of the PBR model and the cross-network communication links against both models’ outputs, while the RGB model remains fully frozen. By adopting this approach, we safeguard the base RGB model’s weights, and do not risk catastrophic forgetting for that base model. As we discuss in Sec. 5, we have found that a single per-pixel linear layer is sufficient, although we also evaluate the other control schemes from Fig. 5 as well as an attention-based communication layer. Notably, we have also found that disabling the text cross-attention in the PBR model is crucial to out-of-distribution performance; we attribute this to overfitting on the restricted dataset, as this problem worsens with reduced training data. Only allowing prompt attention through the frozen RGB model prevents such overfitting.

4.2 Implementation

Compressing PBR images into latent space RGB diffusion models benefit immensely from a dedicated VAE to down-sample the images into a lowerdimensional latent space [50]. Existing solutions that generate an alternate modality typically encode that modality with the RGB VAE, but PBR images cannot be compressed into the same latent space due to the higher dimensionality. Instead, we could select channel triplets  ba, [bm, br, 0], and bnand process those with the RGB VAE, but we instead choose to train a dedicated PBR VAE — our ablation studies indicate that the distribution mismatch between the PBR channels and the RGB space is too large, and performance suffers. We adopt the VAE architecture and training code from StableDiffusion v1.5 [50], although following Vecchio et al. [67] we set the latent space channel count to 14 for the optimal balance between quality and compression when processing PBR images. Conditioning on existing geometry We concatenate the screen-space geometry normals to the PBR model’s inputs to condition the joint output. Referring to Fig. 5, Collaborative Control encapsulates the ControlNet scheme that would typically be used for this conditioning [10]: as we jointly train from scratch, this does not introduce additional cost.

Generating training data Our dataset for training both the PBR VAE and the Collaborative Control scheme is based on Objaverse [9]: a dataset containing 800,000+ 3D models with annotations for what the models represent (describing both shape and texture). After sanitizing and filtering the dataset we retain roughly 300,000 objects. Each of the objects is rendered with Blender 2.35 from 16 viewpoints encircling the object using a fixed pinhole camera model and a fixed (camera colocated) environment map2as in Fig. 4. For the evaluations in Sec. 5, we randomly leave out 2% of the generated images.

Training Collaborative Control For most of the experiments in Sec. 5, ZeroDiffusion [37,65] is the base RGB model, a zero-terminal-SNR version finetuned from StableDiffusion v1.5 [50]. As Collaborative Control is agnostic to the base model, we also illustrate StableDiffusion v1.5 and v2.1 as base models in Sec. 5. We optimize the PBR model’s weights as well as the cross-network communication layers to minimize the training loss for the RGB and PBR denoising jointly, while keeping the RGB model fully frozen. In almost all cases, we train directly on the final output resolution of  512 × 512for a total of 200,000 update steps with a batch size of 12 with a learning rate to  3e−5 (on one 80 GBVRAM A100, in roughly two days). We evaluate the effect of a larger training budget by training on 8 A100’s for the same number of steps, increasing the batch size by a factor of 8 without affecting training time — for environmental and cost purposes, the training budget is kept low for the main ablation study.

Distribution match metrics As an evaluation of how well the data distribution is modeled, distribution match is considered a proxy to both quality and diversity. The Inception Score (IS [51]), which checks the distribution match against ImageNet, is not relevant in a PBR context as it applies only to RGB images. The Fréchet Inception Distance (FID [17]), which compares the distributions of the last hidden state of the Inceptionv3 [62] network on both real and generated images, has been found to better align to perceptual quality. Finally, the recently introduced CLIP Maximum-Mean Discrepancy (CMMD [23]) compares the distribution of the CLIP embeddings of generated images to that of a reference dataset. It offers significantly improved sample efficiency, and was shown by the authors to be a better indicator of low-level image quality than FID. However, as these metrics are intended for three-channel color images, we evaluate them on PBR images following Chambon et al. [5], by averaging the relevant scores of multiple triplets. We report as PBR distribution match the average of the scores over each of the PBR channels independently, as well as over three additional triplets3, as the full set of triplets is prohibitively expensive to compute (the supplementary contains all the constituting scores).


Fig. 6: t-SNE visualization of CLIP embeddings of all prompts in Objaverse (blue) and our OOD prompts (orange).

Fig. 7: Despite training on  512 × 512, this PBR model is able to produce full quality results in the trained resolution  768 × 768of its base StableDiffusion 2.1 RGB model. If that does not suffice, our model also performs well on zoomed-in regions.

Out-of-distribution (OOD) performance metrics indicate the level to which our generator can align to conditioning that it was not trained on. Recent work has introduced the CLIP alignment score [13,16], which estimates the average distance between the text prompt CLIP embedding and the generated image’s CLIP embedding, indicating how faithfully the prompt was followed. Additionally, OneAlign [72] is a neural model that estimates aesthetics and quality of input images, shown to align well with human opinions. For the OOD comparison, we hand-pick a subset of 50 objects from Objaverse and ask ChatGPT4 [1] to provide 5 unlikely appearances for each object, see Fig. 6 for a visualization of the t-SNE projection of CLIP embeddings in Objaverse and how the generated prompts fall in sparsely covered areas of the CLIP space.

5.1 Comparisons and Ablations

To the best of our knowledge, there are no published PBR generation models that generate PBR images for entire objects or scenes (only for generation of single materials [67]). Therefore, we perform an extensive ablation study on our design choices, taking care to include typical approaches from techniques that generate other modalities than PBR. Please refer to Fig. 8 for a qualitative comparison between the model variants, while Tab. 1 contains the quantitative results.

Comparison between control paradigms We compare the performance of the proposed bidirectional cross-network communication layer against two other paradigms: one inspired by ControlNet-XS [10], and one inspired by AnimateAnyone [20]. In the former, dubbed one-way communication, the communication layers receive as input only the RGB model’s internal state, and they only affect the PBR model’s internal state. The latter, dubbed clockwise communication, functions in the same way for the encoder part of the architecture, but reverses the information flow to go from the PBR model to the RGB model for the decoder half of the architecture. We see that the one-way attention does not perform well, with lower distribution match scores as well as OOD performance scores; the frozen RGB model cannot realign to the conditional distribution required from it in Eq. (1). The clockwise attention performs significantly better, but is likely still hampered by  z′rgb,t−1 not being easily available to the PBR model — a similar reasoning as to why the authors of ControlNet-XS included the direct communication link between the base and controlling models’ encoders.

Comparison between communication types In terms of the type of communication, we compare the proposed single-layer per-pixel communication against a per-pixel MLP-based communication layer, and a global attention layer. The latter performs surprisingly well considering that it lacks pixel correspondences; it is hard to enforce pixel-wise alignment through a global attention layer, which we hypothesize to be the reason for the lower quantitative performance. As Jin et al. [24] discuss, an attention-based architecture is also less robust to resolution changes. The per-pixel MLP, containing four hidden per-pixel linear layers with normalization layers [2] in-between, does not qualitatively perform notably better than the single-layer communication layer, so that we settle for the simpler and more computationally efficient choice.

Comparison against fine-tuning We also compare Collaborative Control against the alternative where we edit the first and last layers of the pre-trained network to match the dimensionality of the PBR images (optionally with the rendered image), and then fine-tune the entire network end-to-end. Although the distribution match scores for these fine-tuning variants are similar to Collaborative Control, the fine-tuning methods strongly overfit to the training data and perform very poorly in a qualitative OOD comparison.

PBR-specific VAE vs RGB VAE We compare the performance of Collaborative Control with a PBR-specific VAE against a version that uses the triplets-based RGB VAE mentioned in Sec. 4 to encode the PBR channels (encoding albedo, roughness+metallic, and bump maps in separate triplets and concatenating their latent representations). The mismatch within the PBR domain is clear, both quantitatively through the worse distribution matching scores, and qualitatively in the produced images.

Impact of the training budget Comparing the version training on a single A100 with the version trained with 8 A100s (for eight times the batch size), we see that the latter performs significantly better quantitatively in terms of distribution match, but not quality. Visually, the differences are less clear, although the higherbudget model appears to follow complex prompts slightly better.

Impact of the training resolution We compare the performance of Collaborative Control with two training resolutions:  256 × 256and  512 × 512, with an evaluation resolution of  512 × 512(which is also what the ZeroDiffusion base model was trained on). While the low-resolution model quantitatively performs better, visually it is clear that it does not capture the same level of detail as the high-resolution model — we blame this on the metrics not capturing low-level image quality well, focusing instead on high-level encoding of the images, while the lower resolution enables a larger batch size of 42.


Fig. 8: Generated albedo, roughness/metallic and bump map images from the ablation studies. While significant quality differences are visible, only the fine-tuning approach and the data-sparse regime with PBR prompt cross-attention fail completely. The version that was trained on a smaller resolution does not break but does not result in maximum detail either. Best viewed digitally.


Table 1: Quantative results for all evaluated variants. The ablation baseline is highlighted in bold, duplicated for easier comparisons within the individual ablations.

Impact of training dataset size Finally, we evaluate the performance of Collaborative Control when trained on decreasing amounts of data. For this purpose, we evaluate models trained on 98%, 20%, 5% and 1% of the 6M training images in our full dataset, showing in both the quantitative and qualitative results that the proposed approach is very data-efficient and performs well even when trained on only a few thousand images. We train these models both with and without text cross-attention in the PBR model: crucially, we observe that it is necessary to disable the text cross-attention layer in the PBR model, and that this effect gets more pronounced with fewer data. We hypothesize that the model overfits to the training data, and that forcing prompt attention to occur through the frozen RGB base model prevents this overfitting.

5.2 Compatibility with other control techniques

As a closing experiment, we illustrate that Collaborative Control is compatible with other control techniques [10,46,76], which drastically expands the practical applications of our proposed method. We demonstrate this specifically with IP-Adapter [76]. IP-Adapter allows us to condition the final output on a style image, by only introducing additional style cross-attention layers to the base model. We can apply an IP-Adapter overlay to the base model without requiring retraining, as illustrated in Fig. 9.


Fig. 9: Our PBR diffusion remains compatible with control techniques trained for the frozen RGB model they are linked to. We illustrate this using StableDiffusion 1.5 as the base model using the publically available IP-Adapter [76].

Fig. 10: The most common failure case of our proposed model is the absence of content in the roughness, metallic, and bump maps. Prompting a porcelain barrel with intricate designs for two different random seeds illustrates this behaviour.

5.3 Limitations and failure cases

We identify two major failure cases: lack of detail in the roughness, metallic, and bump maps, and a failure to follow OOD prompts. We attribute the former to the training data: Objaverse contains many objects with constant roughness and metallic properties, and without any details in the surface bump map. This likely biases the model towards such outputs, as shown illustrated in Fig. 10. Anecdotally, we have found that selecting a different random seed will often succeed where the first generation disappointed — practically, the model produces very diverse results even for the same prompt and the same conditioning geometry, so that we argue that this is either not a significant issue or can be resolved with better training data. A failure to follow out-of-distribution prompts happens mostly when structural features in the prompt are incompatible with the conditioning geometry, such as for example a gilded lion for a table mesh. We hypothesize that the control signal from the PBR model conflicts with the text cross-attention in the frozen RGB model, resulting in lackluster outputs. Different random seeds occasionally resolve this issue, albeit more rarely.

In this work, we have proposed Collaborative Control, a new paradigm for leveraging a pre-trained image-based RGB diffusion model for generating high-quality PBR image content conditioned on object geometry. We have shown that this bi-directional control paradigm is extremely data-efficient while retaining the high quality and expressiveness of the base RGB model, even when faced with text queries completely out of distribution for the PBR training data. The plug-and-play nature of our proposed approach is compatible with existing adaptations of the base RGB model, which we have illustrated with IP-Adapter for style guidance of the PBR content. The availability of high-quality PBR content generation as offered by our proposed approach opens up new avenues for graphics applications, specifically in Text-to-Texture.

1. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. stat 1050, 21 (2016)

3. Burley, B.: Physically based shading at disney. In: ACM Transactions on Graphics (SIGGRAPH) (2012)

4. Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4169–4181 (2023)

5. Chambon, T., Heitz, E., Belcour, L.: Passing multi-channel material textures to a 3-channel loss. In: ACM SIGGRAPH 2021 Talks, pp. 1–2 (2021)

6. Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: Textdriven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)

7. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)

8. Cook, R.L., Torrance, K.E.: A reflectance model for computer graphics. ACM Transactions on Graphics (ToG) (1982)

9. Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13142–13153 (2023)

10. Denis Zavadski, J.F.F., Rother, C.: Controlnet-xs: Designing an efficient and effective architecture for controlling text-to-image diffusion models (2023)

11. Du, X., Kolkin, N., Shakhnarovich, G., Bhattad, A.: Generative models: What do they know? do they know things? let’s find out! arXiv (2023)

12. Duan, Y., Guo, X., Zhu, Z.: Diffusiondepth: Diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023)

13. Foong, T.Y., Kotyan, S., Mao, P.Y., Vargas, D.V.: The challenges of image generation models in generating multi-component images. arXiv preprint arXiv:2311.13620 (2023)

14. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020)

15. Guo, P., Hao, H., Caccavale, A., Ren, Z., Zhang, E., Shan, Q., Sankar, A., Schwing, A.G., Colburn, A., Ma, F.: Stabledreamer: Taming noisy score distillation sampling for text-to-3d. arXiv preprint arXiv:2312.02189 (2023)

16. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

17. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)

18. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)

19. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

20. Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117 (2023)

21. Huang, T., Zeng, Y., Zhang, Z., Xu, W., Xu, H., Xu, S., Lau, R.W., Zuo, W.: Dreamcontrol: Control-based text-to-3d generation with 3d self-prior. arXiv preprint arXiv:2312.06439 (2023)

22. Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L.: Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422 (2023)

23. Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking fid: Towards a better evaluation metric for image generation. arXiv preprint arXiv:2401.09603 (2023)

24. Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. arXiv preprint arXiv:2306.08645 (2023)

25. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)

26. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila, T.: Alias-free generative adversarial networks. Advances in Neural Information Processing Systems 34, 852–863 (2021)

27. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

28. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8110–8119 (2020)

29. Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145 (2023)

30. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

31. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

32. Knodt, J., Gao, X.: Consistent mesh diffusion. arXiv preprint arXiv:2312.00971 (2023)

33. Le, C., Hetang, C., Cao, A., He, Y.: Euclidreamer: Fast and high-quality texturing for 3d models with stable diffusion depth. arXiv preprint arXiv:2311.15573 (2023)

34. Lee, H.Y., Tseng, H.Y., Yang, M.H.: Exploiting diffusion prior for generalizable pixel-level semantic prediction. arXiv preprint arXiv:2311.18832 (2023)

35. Li, X., Zhang, Q., Kang, D., Cheng, W., Gao, Y., Zhang, J., Liang, Z., Liao, J., Cao, Y.P., Shan, Y.: Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807 (2024)

36. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. arXiv preprint arXiv:2311.11284 (2023)

37. Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5404–5411 (2024)

38. Liu, F., Wu, D., Wei, Y., Rao, Y., Duan, Y.: Sherpa3d: Boosting high-fidelity text-to-3d generation via coarse 3d prior. arXiv preprint arXiv:2312.06655 (2023)

39. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9298–9309 (2023)

40. Liu, Y.T., Luo, G., Sun, H., Yin, W., Guo, Y.C., Zhang, S.H.: Pi3d: Efficient text-to-3d generation with pseudo-image diffusion. arXiv preprint arXiv:2312.09069 (2023)

41. Liu, Z., Li, Y., Lin, Y., Yu, X., Peng, S., Cao, Y.P., Qi, X., Huang, X., Liang, D., Ouyang, W.: Unidream: Unifying diffusion priors for relightable text-to-3d generation. arXiv preprint arXiv:2312.08754 (2023)

42. Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023)

43. Ma, B., Deng, H., Zhou, J., Liu, Y.S., Huang, T., Wang, X.: Geodream: Disentangling 2d and geometric priors for high-fidelity and consistent 3d generation. arXiv preprint arXiv:2311.17971 (2023)

44. Ma, Y., Fan, Y., Ji, J., Wang, H., Sun, X., Jiang, G., Shu, A., Ji, R.: X-dreamer: Creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv preprint arXiv:2312.00085 (2023)

45. Mercier, A., Nakhli, R., Reddy, M., Yasarla, R., Cai, H., Porikli, F., Berger, G.: Hexagen3d: Stablediffusion is just one step away from fast and diverse text-to-3d generation. arXiv preprint arXiv:2401.07727 (2024)

46. Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)

47. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29 (2016)

48. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

49. Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., Barron, J., et al.: Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508 (2023)

50. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

51. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems 29 (2016)

52. Sarkar, A., Mai, H., Mahapatra, A., Lazebnik, S., Forsyth, D.A., Bhattad, A.: Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry... for now. arXiv preprint arXiv:2311.17138 (2023)

53. Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)

54. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)

55. Sharma, P., Jampani, V., Li, Y., Jia, X., Lagun, D., Durand, F., Freeman, W.T., Matthews, M.: Alchemist: Parametric control of material properties with diffusion models. arXiv preprint arXiv:2312.02970 (2023)

56. Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)

57. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)

58. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015)

59. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

60. Stan, G.B.M., Wofk, D., Fox, S., Redden, A., Saxton, W., Yu, J., Aflalo, E., Tseng, S.Y., Nonato, F., Muller, M., et al.: Ldm3d: Latent diffusion model for 3d. arXiv preprint arXiv:2305.10853 (2023)

61. Subias, J.D., Lagunas, M.: In-the-wild material appearance editing using perceptual attributes. In: Computer Graphics Forum. vol. 42, pp. 333–345. Wiley Online Library (2023)

62. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2818–2826 (2016)

63. Tang, B., Wang, J., Wu, Z., Zhang, L.: Stable score distillation for high-quality 3d generation. arXiv preprint arXiv:2312.09305 (2023)

64. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)

65. https://huggingface.co/drhead: Huggingface zerodiffusion model weights v0.9. https://huggingface.co/drhead/ZeroDiffusion, accessed: 2024-02-08

66. Van Den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: International conference on machine learning. pp. 1747–1756. PMLR (2016)

67. Vecchio, G., Martin, R., Roullier, A., Kaiser, A., Rouffet, R., Deschaintre, V., Boubekeur, T.: Controlmat: A controlled generative approach to material capture. arXiv preprint arXiv:2309.01700 (2023)

68. Wan, Z., Paschalidou, D., Huang, I., Liu, H., Shen, B., Xiang, X., Liao, J., Guibas, L.: Cad: Photorealistic 3d generation via adversarial distillation. arXiv preprint arXiv:2312.06663 (2023)

69. Wang, P., Fan, Z., Xu, D., Wang, D., Mohan, S., Iandola, F., Ranjan, R., Li, Y., Liu, Q., Wang, Z., et al.: Steindreamer: Variance reduction for text-to-3d score distillation via stein identity. arXiv preprint arXiv:2401.00604 (2023)

70. Wang, Z., Li, M., Chen, C.: Luciddreaming: Controllable object-centric 3d generation. arXiv preprint arXiv:2312.00588 (2023)

71. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)

72. Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)

73. Wu, T., Li, Z., Yang, S., Zhang, P., Pan, X., Wang, J., Lin, D., Liu, Z.: Hyperdreamer: Hyper-realistic 3d content generation and editing from a single image. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–10 (2023)

74. Wu, Z., Zhou, P., Yi, X., Yuan, X., Zhang, H.: Consistent3d: Towards consistent high-fidelity text-to-3d generation with deterministic sampling prior. arXiv preprint arXiv:2401.09050 (2024)

75. Xu, X., Lyu, Z., Pan, X., Dai, B.: Matlaber: Material-aware text-to-3d via latent brdf auto-encoder. arXiv preprint arXiv:2308.09278 (2023)

76. Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models (2023)

77. Yeh, Y.Y., Huang, J.B., Kim, C., Xiao, L., Nguyen-Phuoc, T., Khan, N., Zhang, C., Chandraker, M., Marshall, C.S., Dong, Z., et al.: Texturedreamer: Image-guided texture synthesis through geometry-aware diffusion. arXiv preprint arXiv:2401.09416 (2024)

78. Youwang, K., Oh, T.H., Pons-Moll, G.: Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering. arXiv preprint arXiv:2312.11360 (2023)

79. Yu, K., Liu, J., Feng, M., Cui, M., Xie, X.: Boosting3d: High-fidelity image-to-3d by boosting 2d diffusion prior to 3d prior with progressive learning. arXiv preprint arXiv:2311.13617 (2023)

80. Zeng, X.: Paint3d: Paint anything 3d with lighting-less texture diffusion models. arXiv preprint arXiv:2312.13913 (2023)

81. Zhang, J., Tang, Z., Pang, Y., Cheng, X., Jin, P., Wei, Y., Yu, W., Ning, M., Yuan, L.: Repaint123: Fast and high-quality one image to 3d generation with progressive controllable 2d repainting. arXiv preprint arXiv:2312.13271 (2023)

82. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)

83. Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153 (2023)

84. Zhou, L., Shih, A., Meng, C., Ermon, S.: Dreampropeller: Supercharge text-to-3d generation with parallel sampling. arXiv preprint arXiv:2311.17082 (2023)

85. Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: Dreameditor: Text-driven 3d scene editing with neural fields. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–10 (2023)

Designed for Accessibility and to further Open Science