This research evaluates image processing techniques to enhance 3D reconstruction quality from single-camera inputs, prioritizing computational efficiency for deployment on Raspberry Pi Zero 2 W. The pipeline targets object segmentation as a critical preprocessing step for AI-based 3D generation using Tencent’s Hunyuan API.

Image Processing for 3D Computer Vision

Feature Extraction and Matching Algorithm

Traditional 3D reconstruction methods follow a structured pipeline beginning with Structure from Motion (SfM), which extracts feature points—such as SIFT keypoints—along with their descriptors from input images. These features capture scale- and rotation-invariant characteristics essential for robust matching. The process then advances to Multi-View Stereo (MVS), where corresponding features are matched across multiple images to estimate depth, enabling subsequent stages like mesh reconstruction. This first part of the study specifically focuses on analyzing the SfM and MVS components, as they are critical initial steps in transforming 2D images into 3D representations.

I developed algorithms based on online open-source resources, specifically implementing the SIFT (Scale-Invariant Feature Transform) and Template Matching techniques. These algorithms extract key feature points and match them effectively from image sets taken by binocular devices.

[v1.1] SIFT (sfm1.py)

A key feature of this algorithm is its utilization of the built-in SIFT library in OpenCV. The concise 20-line implementation is both elegant and highly efficient for feature extraction and matching. Its low computational requirements make it particularly well-suited for resource-constrained devices such as the Raspberry Pi 0 or ESP32.

The correctness of matching depends on the number of matching points, although a higher number of matches does not necessarily guarantee greater accuracy. In general, SIFT performs well in feature extraction and matching. However, when images contain noisy backgrounds, additional steps such as main object detection and segmentation are required to improve performance.

[v1.2] SIFT + Template Matching (sfm2.py)

Due to hardware constraints requiring maximum portability for child-friendly use, we abandoned the binocular camera approach in favor of a single-camera solution. The next part shows explorations of using a two images taken from different angles for feature processing. Template Matching method is introduced as an add-on of SIFT algorithm.

This code performs feature-based template matching using SIFT descriptors by sliding the template over the query image, computing SIFT features for each region of interest (ROI), and comparing them to the template's descriptors via a brute-force matcher. The similarity scores are stored in a similarity map, with the best match identified as the peak value, making the approach more robust to scale and rotation than traditional pixel-wise methods.

Comparing to pure Template Matching , SIFT-enabled Template Matching shows more accuracy in the matching process. However, it takes more time (~120s) and requires higher computational power.

Meanwhile, we discovered and decided to use Tencent's Hunyuan model, an advanced open API that generates 3D models from single 2D images using AI-powered reconstruction. This simplified the workflow to require only one image without perspective constraints and thus, my research is shifted to focus on optimizing image preprocessing techniques, for enhancing 3D models generated from children's photographs. The study subsequently explored how strategic image segmentation and enhancement could improve input quality for single-image 3D reconstruction systems.

Object Segmentation Algorithm

Our investigation revealed that children primarily want to generate 3D models of discrete objects, simplifying our image processing requirements to single-object segmentation. This focus led us to evaluate several computationally efficient segmentation algorithms suitable for the Raspberry Pi Zero 2 W's limited resources.

I implemented and compared four approaches: basic Threshold Segmentation, Otsu's Binary Thresholding, Watershed-based segmentation, and Multi-layer Thresholding. All implementations leveraged OpenCV for core image processing, NumPy for efficient array operations, and Matplotlib for visualization.

[v2.1] Multi-layer Thresholding (os1.py)

The current implementation of my multi-layer thresholding algorithm employs a mean-based adaptive threshold combined with vectorized multi-level segmentation to achieve intensity-tiered region classification.

This approach demonstrates promising results for object segmentation, particularly in identifying regions with the greatest intensity diversity as shown in version 2.2 of implementation. The method's effectiveness suggests its potential as a reliable technique for target object isolation within constrained computational environments.

[v2.2] Otsu Binary Thresholding + Watershed Method

This hybrid approach combines Otsu's thresholding with Watershed segmentation to achieve robust foreground-background separation. Otsu's method automatically determines the optimal global threshold for initial segmentation, while the Watershed algorithm effectively handles overlapping regions and boundary refinement in the identified foreground.

[v2.3] Combined Multi-layering and Otsu Binary Thresholding (os3.py)

This hybrid thresholding method combines mean-based global intensity classification with local entropy-driven texture analysis to achieve robust image segmentation. The approach demonstrates particular effectiveness on images with homogeneous backgrounds. In addition, the balanced integration maintains computational efficiency while producing clean segmentation results suitable for resource-constrained implementations.

However, when pre-processing an image before inputting it into Hunyuan AI with v2.3, the algorithm didn’t perform as good as merely using Hunyuan originally was :(...

After diving into Hunyuan’s object segmentation unit from their open-source code, the rationale was clear:

(https://huggingface.co/spaces/tencent/Hunyuan3D-2/tree/main, preprocessors.py)

They utilized instance segmentation alongside alpha channel (mask) extraction and composition. Depth-aware masks are also utilized to produce more accurate image segmentation results from single images. This also explains why my pre-processed images failed: by converting images to grayscale, the depth information channel is given up, resulting in less robust segmentation results.

Reflection

While these experiments ultimately weren’t integrated into our final project, the investigation proved invaluable for both technical growth and conceptual understanding. By systematically evaluating different approaches—from classical image processing (Otsu thresholding, multi-layer segmentation) to modern 3D reconstruction pipelines (depth-aware masking, alpha compositing)—I gained hands-on experience with critical computer vision paradigms.

Source Code:

https://github.com/InX-de-sign/ISDN2002IndependentStudy

References:

Inclusive & Assistive Products

ISDN2001/2002: Second Year Design Project