Virtual Fitting Room: Exploring the Technical Aspect

Recently, clothing website owners face a growing need to adapt to changing consumer habits. The rise of online shopping creates a challenge for brands to present their products so that consumers can easily imagine how the clothes will look on them without physically trying them on.

To address this challenge, R&D Center WINSTARS creates an AI solution for digitizing the entire clothing range and creating its 3D models. Such a virtual fitting room allows shoppers to virtually "try on" clothes, checking out different sizes, styles, and colors and getting a more realistic idea of how the clothes will fit. This increases customer satisfaction and reduces the number of product returns, which is a big bonus for site owners.

Therefore, investing in digitization technologies and 3D clothing modeling becomes strategically important for brands that want to stay competitive in today's market environment. But what about the technical side of this solution? This is where Semantic Segmentation comes into play.

Decoding Garments with Semantic Segmentation

There are many different types of clothing, each with unique identifying features. For example, the front of a long-sleeved shirt has different characteristics than the back. When using Computer Vision to analyze a piece of clothing, it is like cutting along the seams to understand its individual parts.

In the realm of Computer Vision tasks, Semantic Segmentation is particularly noteworthy. It carefully labels each pixel in an image, making it possible to identify different garment sections in great detail, such as the collar, sleeves, or the main body.

Picture 1 displays the original illustration of long sleeves (on the left) and their corresponding label mask (on the right). Each color represents a distinct category: the background, main section, sleeves, and collar.

Figure 1 — Example of long sleeve segmentation

This way, the model accurately differentiates the various parts of your product and provides your customers with an unforgettable and user-friendly experience in the virtual fitting room.

Creating a Custom Dataset for Virtual Fitting Room

Our team recognized the lack of open-source datasets for labeled garments. To resolve this issue, we gathered a collection of 2000 images per clothing type from various top brands' websites, with specific criteria for image selection. We ensured each clothing item was flat on a contrasting backdrop and free of any folds.

Figure 2 — Example of squaring images: a) original image; b) squared; c) squared with the same view

We standardized the images to have a uniform view, as there was a variance in image dimensions across different online stores. To bolster the dataset's robustness, we used techniques like shifting and background replacement. By replacing backgrounds and subtly moving the garment's position, we effectively doubled or tripled our dataset. This ensured a comprehensive range of images for training.

Figure 3 — Extending dataset with background replacing and shifting cloth

The model architecture

To ensure effective training of our model, we divided our dataset into three parts: 70% for training, 15% for validation, and 15% for testing. Given hardware memory limitations, we implemented a data loader for swift data feeding during training sessions.

To enhance the efficacy of our model, we utilized the albumentation library for augmentation, which introduced alterations such as color scaling and rotation. This enriched our dataset and enabled our model to recognize garments based on their features rather than just colors.

The U-Net architecture was the cornerstone of our project. It is a famous Computer Vision architecture known for its encoder-decoder structure and skip connections, especially when working with limited training data.

Figure 4 — U-Net architecture

Training and Outstanding Results

While accuracy is a commonly used metric for many machine learning tasks, semantic segmentation requires a more nuanced approach. Therefore, we decided to use the Dice score as our primary metric. This score calculates the overlapping intersection of the actual and predicted outcomes, making it a more appropriate metric for segmentation. By combining the Dice score with our BCEDiceLoss loss function, our model aimed to maximize the score and achieve more accurate segmentation.

Our hyperparameters were:

Image size: (512, 512)
Batch size: 8
Optimizer: Adam
Activation function: softmax
Metrics: DiceScore
Loss: BCEDiceLoss

After fine-tuning our hyperparameters, our model achieved an impressive average Dice score of 0.981537 on the test set. Examples of the test images are in Picture 5.

Figure 5 — Results of image segmentation

Conclusion

The exploration into Semantic Segmentation has created new opportunities for the future of online fashion shopping, making it more engaging and precise. By harnessing the power of the U-Net architecture and fine-tuning hyper parameters, the R&D Center WINSTARS team has achieved remarkable segmentation results, evident in the impressive Dice score. Behind this project's technical aspects lies an excellent potential for the E-commerce industry.

By implementing 3D virtual fitting rooms, customers can now enjoy a personalized shopping adventure from the comfort of their screens. This will result in satisfied and loyal customers, an increase in the customer base, and, ultimately, an increase in income. As the lines between the physical and digital worlds blur further, solutions like these will be instrumental in bridging the gap and setting new benchmarks in online retail.