E-Commerce in Large Format: How a Software Engineer Sorts Millions of Chaotic Product Attributes

Most debates about e-commerce scaling revolve around sexual topics: distributed search systems, live inventory management, recommendation algorithms. But lurking behind is a quieter, more persistent problem: managing attribute values. It’s technical noise present in every large online store.

The Silent Problem: Why Attribute Values Complicate Everything

Product attributes are fundamental to the customer experience. They drive filters, comparisons, and search rankings. In theory, it sounds simple. In reality: raw values are chaotic.

A simple set might look like: “XL”, “Small”, “12cm”, “Large”, “M”, “S”. Colors? “RAL 3020”, “Crimson”, “Red”, “Dark Red”. Material? “Steel”, “Carbon Steel”, “Stainless”, “Stainless Steel”.

Individually, these inconsistencies seem harmless. But multiply that across 3 million SKUs, each with dozens of attributes – the problem becomes systemic. Filters behave unpredictably. Search engines lose relevance. Customers experience slower, more frustrating browsing. And backend teams drown in manual data cleaning.

A software engineer at Zoro faced exactly this challenge: a problem easy to overlook but impacting every product page.

The Path to Intelligent Automation Without Losing Control

The first principle was clear: no black-box AI. Such systems are hard to trust, debug, or scale.

Instead, a hybrid pipeline was developed that:

  • remains explainable
  • works predictably
  • truly scales
  • is controllable by humans

The result combined the contextual thinking of modern language models with fixed rules and controls. AI with guardrails, not AI out of control.

Architecture Overview: How It All Fits

The entire process runs in offline background jobs, not in real-time. This was not a compromise – it was architecturally necessary.

Real-time pipelines may sound tempting, but lead to:

  • Unpredictable latency
  • Fragile dependencies
  • Costly compute peaks
  • Operational fragility

Offline processing enables:

  • High throughput: massive data volumes without affecting live systems
  • Resilience: errors never impact customer traffic
  • Cost control: schedule computations during low-traffic times
  • Isolation: language model latency never affects product pages
  • Consistency: updates are atomic and predictable

The architecture works as follows:

  1. Product data comes from the PIM system
  2. An extraction job pulls raw values and context
  3. These go to an AI sorting service
  4. Updated documents land in MongoDB
  5. Outbound synchronization updates the original system
  6. Elasticsearch and Vespa sync the sorted data
  7. APIs connect everything to the customer interface

The Four Layers of the Solution

Layer 1: Data Preparation

Before applying intelligence, a clear preprocessing step was performed. Trimming whitespace. Deduplicating values. Contextualizing category breadcrumbs into structured strings. Removing empty entries.

This may seem fundamental, but it significantly improved AI performance. Garbage in, garbage out – at this scale, small errors can cause big problems later.

Layer 2: Intelligent Sorting with Context

The language model was not just a sorting tool. It reasoned about the values.

The service received:

  • Cleaned attribute values
  • Category metadata
  • Attribute definitions

With this context, the model could understand:

  • That “Voltage” in power tools should be numeric
  • That “Size” in clothing follows a known progression
  • That “Color” may follow RAL standards
  • That “Material” has semantic relations

The model returned:

  • Ordered values in logical sequence
  • Refined attribute names
  • A decision: deterministic or contextual sorting

Layer 3: Deterministic Fallbacks

Not every attribute needs intelligence. Numeric ranges, unit-based values, and simple sets benefit from:

  • Faster processing
  • Predictable output
  • Lower costs
  • Zero ambiguity

The pipeline automatically recognized these cases and used deterministic logic. This kept the system efficient and avoided unnecessary LLM calls.

Layer 4: Human Override

Each category could be tagged as:

  • LLM_SORT: The model decides
  • MANUAL_SORT: Humans define the order

This dual system allowed humans to make final decisions while intelligence handled the heavy lifting. It also built trust – merchants could override the model at any time.

From Chaos to Clarity: Practical Results

The pipeline transformed chaotic raw data:

Attribute Input Values Sorted Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, RAL 3020 (
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

These examples show how combining contextual understanding with clear rules works.

Persistence and Control Across the Entire Chain

All results were stored directly in a Product MongoDB. MongoDB became the single source of truth for:

  • Sorted attribute values
  • Refined attribute names
  • Category-specific sort tags
  • Product-specific sort orders

This simplified reviews, overrides, reprocessing categories, and synchronization with other systems.

After sorting, values flowed into:

  • Elasticsearch for keyword-based search
  • Vespa for semantic and vector-based search

This ensured filters displayed in logical order, product pages showed consistent attributes, and search engines ranked products more accurately.

Why Not Just Use Real-Time?

Real-time processing would mean:

  • Unpredictable latency for live requests
  • Higher compute costs for instant results
  • Fragile dependencies between systems
  • Operational complexity and potential errors during customer traffic

Offline jobs offered:

  • Scalability over millions of products
  • Asynchronous LLM calls without affecting live performance
  • Robust retry logic
  • Windows for human review
  • Predictable compute outputs

The trade-off was a slight delay between data ingestion and display. The benefit was consistency at scale – which customers value much more.

Measurable Impact

The solution delivered:

  • Consistent attribute sorting across 3M+ SKUs
  • Predictable numeric order via deterministic fallbacks
  • Business control through manual tagging
  • Cleaner product pages and more intuitive filters
  • Improved search relevance and ranking
  • Increased customer trust and better conversion rates

This was not just a technical win – it was also a victory for user experience and business results.

Key Takeaways for E-Commerce Software Engineers

  • Hybrid pipelines outperform pure AI at scale. Intelligence needs guardrails.
  • Context dramatically improves language model accuracy.
  • Offline jobs are essential for throughput and resilience.
  • Human override mechanisms build trust and acceptance.
  • Clean inputs are the foundation for reliable outputs.

Conclusion

Sorting attribute values sounds simple. But when it involves millions of products, it becomes a real challenge.

By combining language model intelligence with clear rules, contextual understanding, and human control, a complex, hidden problem was transformed into a clean, scalable system.

It reminds us that some of the greatest successes come from solving boring problems – those that are easy to overlook but appear on every product page.

VON-5,48%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)