Specification-Driven Extraction Engineering:
Design and maintain declarative extraction specifications—using Pydantic models, JSON schemas, or domain-specific languages—that describe exactly which fields to capture, their types, and validation rules.
Implement pipelines that translate these specifications into executable extraction plans, leveraging both classical (Scrapy, Playwright) and AI-augmented (LLM-based semantic parsing) backends.
Build reusable specification libraries for recurring data types (product prices, tariff codes, regulatory texts) to accelerate onboarding of new sources.
Autonomous & Self-Healing Systems:
Deploy self-healing spiders that automatically detect website layout changes and repair themselves using Model Context Protocol (MCP) servers (e.g., Scrapy MCP Server, Playwright MCP).
Integrate semantic extraction (Scrapy-LLM, custom LLM pipelines) to eliminate selector brittleness—spiders rely on field description...