E-Commerce in Large Format: How a Software Engineer Sorts Millions of Chaotic Product Attributes

2026-01-15 22:40:43

Most debates about e-commerce scaling revolve around sexual topics: distributed search systems, live inventory management, recommendation algorithms. But lurking behind is a quieter, more persistent problem: managing attribute values. It’s technical noise present in every large online store.

The Silent Problem: Why Attribute Values Complicate Everything

Product attributes are fundamental to the customer experience. They drive filters, comparisons, and search rankings. In theory, it sounds simple. In reality: raw values are chaotic.

A simple set might look like: “XL”, “Small”, “12cm”, “Large”, “M”, “S”. Colors? “RAL 3020”, “Crimson”, “Red”, “Dark Red”. Material? “Steel”, “Carbon Steel”, “Stainless”, “Stainless Steel”.

Individually, these inconsistencies seem harmless. But multiply that across 3 million SKUs, each with dozens of attributes – the problem becomes systemic. Filters behave unpredictably. Search engines lose relevance. Customers experience slower, more frustrating browsing. And backend teams drown in manual data cleaning.

A software engineer at Zoro faced exactly this challenge: a problem easy to overlook but impacting every product page.

The Path to Intelligent Automation Without Losing Control

The first principle was clear: no black-box AI. Such systems are hard to trust, debug, or scale.

Instead, a hybrid pipeline was developed that:

remains explainable
works predictably
truly scales
is controllable by humans

The result combined the contextual thinking of modern language models with fixed rules and controls. AI with guardrails, not AI out of control.

Architecture Overview: How It All Fits

The entire process runs in offline background jobs, not in real-time. This was not a compromise – it was architecturally necessary.

Real-time pipelines may sound tempting, but lead to:

Unpredictable latency
Fragile dependencies
Costly compute peaks
Operational fragility

Offline processing enables:

High throughput: massive data volumes without affecting live systems
Resilience: errors never impact customer traffic
Cost control: schedule computations during low-traffic times
Isolation: language model latency never affects product pages
Consistency: updates are atomic and predictable

The architecture works as follows:

Product data comes from the PIM system
An extraction job pulls raw values and context
These go to an AI sorting service
Updated documents land in MongoDB
Outbound synchronization updates the original system
Elasticsearch and Vespa sync the sorted data
APIs connect everything to the customer interface

The Four Layers of the Solution

Layer 1: Data Preparation

Before applying intelligence, a clear preprocessing step was performed. Trimming whitespace. Deduplicating values. Contextualizing category breadcrumbs into structured strings. Removing empty entries.

This may seem fundamental, but it significantly improved AI performance. Garbage in, garbage out – at this scale, small errors can cause big problems later.

Layer 2: Intelligent Sorting with Context

The language model was not just a sorting tool. It reasoned about the values.

The service received:

Cleaned attribute values
Category metadata
Attribute definitions

With this context, the model could understand:

That “Voltage” in power tools should be numeric
That “Size” in clothing follows a known progression
That “Color” may follow RAL standards
That “Material” has semantic relations

The model returned:

Ordered values in logical sequence
Refined attribute names
A decision: deterministic or contextual sorting

Layer 3: Deterministic Fallbacks

Not every attribute needs intelligence. Numeric ranges, unit-based values, and simple sets benefit from:

Faster processing
Predictable output
Lower costs
Zero ambiguity

The pipeline automatically recognized these cases and used deterministic logic. This kept the system efficient and avoided unnecessary LLM calls.

Layer 4: Human Override

Each category could be tagged as:

LLM_SORT: The model decides
MANUAL_SORT: Humans define the order

This dual system allowed humans to make final decisions while intelligence handled the heavy lifting. It also built trust – merchants could override the model at any time.

From Chaos to Clarity: Practical Results

The pipeline transformed chaotic raw data:

Attribute	Input Values	Sorted Output
Size	XL, Small, 12cm, Large, M, S	Small, M, Large, XL, 12cm
Color	RAL 3020, Crimson, Red, Dark Red	Red, Dark Red, Crimson, RAL 3020 (
Material	Steel, Carbon Steel, Stainless, Stainless Steel	Steel, Stainless Steel, Carbon Steel
Numeric	5cm, 12cm, 2cm, 20cm	2cm, 5cm, 12cm, 20cm

These examples show how combining contextual understanding with clear rules works.

Persistence and Control Across the Entire Chain

All results were stored directly in a Product MongoDB. MongoDB became the single source of truth for:

Sorted attribute values
Refined attribute names
Category-specific sort tags
Product-specific sort orders

This simplified reviews, overrides, reprocessing categories, and synchronization with other systems.

After sorting, values flowed into:

Elasticsearch for keyword-based search
Vespa for semantic and vector-based search

This ensured filters displayed in logical order, product pages showed consistent attributes, and search engines ranked products more accurately.

Why Not Just Use Real-Time?

Real-time processing would mean:

Unpredictable latency for live requests
Higher compute costs for instant results
Fragile dependencies between systems
Operational complexity and potential errors during customer traffic

Offline jobs offered:

Scalability over millions of products
Asynchronous LLM calls without affecting live performance
Robust retry logic
Windows for human review
Predictable compute outputs

The trade-off was a slight delay between data ingestion and display. The benefit was consistency at scale – which customers value much more.

Measurable Impact

The solution delivered:

Consistent attribute sorting across 3M+ SKUs
Predictable numeric order via deterministic fallbacks
Business control through manual tagging
Cleaner product pages and more intuitive filters
Improved search relevance and ranking
Increased customer trust and better conversion rates

This was not just a technical win – it was also a victory for user experience and business results.

Key Takeaways for E-Commerce Software Engineers

Hybrid pipelines outperform pure AI at scale. Intelligence needs guardrails.
Context dramatically improves language model accuracy.
Offline jobs are essential for throughput and resilience.
Human override mechanisms build trust and acceptance.
Clean inputs are the foundation for reliable outputs.

Conclusion

Sorting attribute values sounds simple. But when it involves millions of products, it becomes a real challenge.

By combining language model intelligence with clear rules, contextual understanding, and human control, a complex, hidden problem was transformed into a clean, scalable system.

It reminds us that some of the greatest successes come from solving boring problems – those that are easy to overlook but appear on every product page.

VON-5,48%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.