Most debates about e-commerce scaling revolve around sexual topics: distributed search systems, live inventory management, recommendation algorithms. But lurking behind is a quieter, more persistent problem: managing attribute values. It’s technical noise present in every large online store.
The Silent Problem: Why Attribute Values Complicate Everything
Product attributes are fundamental to the customer experience. They drive filters, comparisons, and search rankings. In theory, it sounds simple. In reality: raw values are chaotic.
Individually, these inconsistencies seem harmless. But multiply that across 3 million SKUs, each with dozens of attributes – the problem becomes systemic. Filters behave unpredictably. Search engines lose relevance. Customers experience slower, more frustrating browsing. And backend teams drown in manual data cleaning.
A software engineer at Zoro faced exactly this challenge: a problem easy to overlook but impacting every product page.
The Path to Intelligent Automation Without Losing Control
The first principle was clear: no black-box AI. Such systems are hard to trust, debug, or scale.
Instead, a hybrid pipeline was developed that:
remains explainable
works predictably
truly scales
is controllable by humans
The result combined the contextual thinking of modern language models with fixed rules and controls. AI with guardrails, not AI out of control.
Architecture Overview: How It All Fits
The entire process runs in offline background jobs, not in real-time. This was not a compromise – it was architecturally necessary.
Real-time pipelines may sound tempting, but lead to:
Unpredictable latency
Fragile dependencies
Costly compute peaks
Operational fragility
Offline processing enables:
High throughput: massive data volumes without affecting live systems
Resilience: errors never impact customer traffic
Cost control: schedule computations during low-traffic times
Isolation: language model latency never affects product pages
Consistency: updates are atomic and predictable
The architecture works as follows:
Product data comes from the PIM system
An extraction job pulls raw values and context
These go to an AI sorting service
Updated documents land in MongoDB
Outbound synchronization updates the original system
Elasticsearch and Vespa sync the sorted data
APIs connect everything to the customer interface
The Four Layers of the Solution
Layer 1: Data Preparation
Before applying intelligence, a clear preprocessing step was performed. Trimming whitespace. Deduplicating values. Contextualizing category breadcrumbs into structured strings. Removing empty entries.
This may seem fundamental, but it significantly improved AI performance. Garbage in, garbage out – at this scale, small errors can cause big problems later.
Layer 2: Intelligent Sorting with Context
The language model was not just a sorting tool. It reasoned about the values.
The service received:
Cleaned attribute values
Category metadata
Attribute definitions
With this context, the model could understand:
That “Voltage” in power tools should be numeric
That “Size” in clothing follows a known progression
That “Color” may follow RAL standards
That “Material” has semantic relations
The model returned:
Ordered values in logical sequence
Refined attribute names
A decision: deterministic or contextual sorting
Layer 3: Deterministic Fallbacks
Not every attribute needs intelligence. Numeric ranges, unit-based values, and simple sets benefit from:
Faster processing
Predictable output
Lower costs
Zero ambiguity
The pipeline automatically recognized these cases and used deterministic logic. This kept the system efficient and avoided unnecessary LLM calls.
Layer 4: Human Override
Each category could be tagged as:
LLM_SORT: The model decides
MANUAL_SORT: Humans define the order
This dual system allowed humans to make final decisions while intelligence handled the heavy lifting. It also built trust – merchants could override the model at any time.
From Chaos to Clarity: Practical Results
The pipeline transformed chaotic raw data:
Attribute
Input Values
Sorted Output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, RAL 3020 (
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
These examples show how combining contextual understanding with clear rules works.
Persistence and Control Across the Entire Chain
All results were stored directly in a Product MongoDB. MongoDB became the single source of truth for:
Sorted attribute values
Refined attribute names
Category-specific sort tags
Product-specific sort orders
This simplified reviews, overrides, reprocessing categories, and synchronization with other systems.
After sorting, values flowed into:
Elasticsearch for keyword-based search
Vespa for semantic and vector-based search
This ensured filters displayed in logical order, product pages showed consistent attributes, and search engines ranked products more accurately.
Why Not Just Use Real-Time?
Real-time processing would mean:
Unpredictable latency for live requests
Higher compute costs for instant results
Fragile dependencies between systems
Operational complexity and potential errors during customer traffic
Offline jobs offered:
Scalability over millions of products
Asynchronous LLM calls without affecting live performance
Robust retry logic
Windows for human review
Predictable compute outputs
The trade-off was a slight delay between data ingestion and display. The benefit was consistency at scale – which customers value much more.
Measurable Impact
The solution delivered:
Consistent attribute sorting across 3M+ SKUs
Predictable numeric order via deterministic fallbacks
Business control through manual tagging
Cleaner product pages and more intuitive filters
Improved search relevance and ranking
Increased customer trust and better conversion rates
This was not just a technical win – it was also a victory for user experience and business results.
Key Takeaways for E-Commerce Software Engineers
Hybrid pipelines outperform pure AI at scale. Intelligence needs guardrails.
Context dramatically improves language model accuracy.
Offline jobs are essential for throughput and resilience.
Human override mechanisms build trust and acceptance.
Clean inputs are the foundation for reliable outputs.
Conclusion
Sorting attribute values sounds simple. But when it involves millions of products, it becomes a real challenge.
By combining language model intelligence with clear rules, contextual understanding, and human control, a complex, hidden problem was transformed into a clean, scalable system.
It reminds us that some of the greatest successes come from solving boring problems – those that are easy to overlook but appear on every product page.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
E-Commerce in Large Format: How a Software Engineer Sorts Millions of Chaotic Product Attributes
Most debates about e-commerce scaling revolve around sexual topics: distributed search systems, live inventory management, recommendation algorithms. But lurking behind is a quieter, more persistent problem: managing attribute values. It’s technical noise present in every large online store.
The Silent Problem: Why Attribute Values Complicate Everything
Product attributes are fundamental to the customer experience. They drive filters, comparisons, and search rankings. In theory, it sounds simple. In reality: raw values are chaotic.
A simple set might look like: “XL”, “Small”, “12cm”, “Large”, “M”, “S”. Colors? “RAL 3020”, “Crimson”, “Red”, “Dark Red”. Material? “Steel”, “Carbon Steel”, “Stainless”, “Stainless Steel”.
Individually, these inconsistencies seem harmless. But multiply that across 3 million SKUs, each with dozens of attributes – the problem becomes systemic. Filters behave unpredictably. Search engines lose relevance. Customers experience slower, more frustrating browsing. And backend teams drown in manual data cleaning.
A software engineer at Zoro faced exactly this challenge: a problem easy to overlook but impacting every product page.
The Path to Intelligent Automation Without Losing Control
The first principle was clear: no black-box AI. Such systems are hard to trust, debug, or scale.
Instead, a hybrid pipeline was developed that:
The result combined the contextual thinking of modern language models with fixed rules and controls. AI with guardrails, not AI out of control.
Architecture Overview: How It All Fits
The entire process runs in offline background jobs, not in real-time. This was not a compromise – it was architecturally necessary.
Real-time pipelines may sound tempting, but lead to:
Offline processing enables:
The architecture works as follows:
The Four Layers of the Solution
Layer 1: Data Preparation
Before applying intelligence, a clear preprocessing step was performed. Trimming whitespace. Deduplicating values. Contextualizing category breadcrumbs into structured strings. Removing empty entries.
This may seem fundamental, but it significantly improved AI performance. Garbage in, garbage out – at this scale, small errors can cause big problems later.
Layer 2: Intelligent Sorting with Context
The language model was not just a sorting tool. It reasoned about the values.
The service received:
With this context, the model could understand:
The model returned:
Layer 3: Deterministic Fallbacks
Not every attribute needs intelligence. Numeric ranges, unit-based values, and simple sets benefit from:
The pipeline automatically recognized these cases and used deterministic logic. This kept the system efficient and avoided unnecessary LLM calls.
Layer 4: Human Override
Each category could be tagged as:
This dual system allowed humans to make final decisions while intelligence handled the heavy lifting. It also built trust – merchants could override the model at any time.
From Chaos to Clarity: Practical Results
The pipeline transformed chaotic raw data:
These examples show how combining contextual understanding with clear rules works.
Persistence and Control Across the Entire Chain
All results were stored directly in a Product MongoDB. MongoDB became the single source of truth for:
This simplified reviews, overrides, reprocessing categories, and synchronization with other systems.
After sorting, values flowed into:
This ensured filters displayed in logical order, product pages showed consistent attributes, and search engines ranked products more accurately.
Why Not Just Use Real-Time?
Real-time processing would mean:
Offline jobs offered:
The trade-off was a slight delay between data ingestion and display. The benefit was consistency at scale – which customers value much more.
Measurable Impact
The solution delivered:
This was not just a technical win – it was also a victory for user experience and business results.
Key Takeaways for E-Commerce Software Engineers
Conclusion
Sorting attribute values sounds simple. But when it involves millions of products, it becomes a real challenge.
By combining language model intelligence with clear rules, contextual understanding, and human control, a complex, hidden problem was transformed into a clean, scalable system.
It reminds us that some of the greatest successes come from solving boring problems – those that are easy to overlook but appear on every product page.