Research Engineer | Smallest.ai
Job Description
Research Engineer
Company Overview
Smallest.ai is an AI research lab pioneering the future of compact, powerful models. We power low latency, high accuracy STT, TTS, S2S and SLM models to power Voice and Multi-Modal AI applications across 100+ industries.
Our platform runs with enterprise-grade security, supports on-prem and private cloud deployments, and is fully SOC2, GDPR, HIPAA, and PCI compliant, making it suitable for regulated and high-trust environments.
Job Summary
This role is focused on transforming messy, real-world data into high-quality inputs that machine learning models can learn from. You will work extensively with speech, language, and real-time systems across multiple languages, emphasizing the importance of data quality and systems in improving model performance.
Responsibilities
- Data Pipelines: Build high-throughput pipelines for processing audio, text, and multimodal data, both in real-time and batch.
- Data Quality & Curation: Engage in cleaning, filtering, deduplication, and normalization of data across various formats (e.g., numbers, emails, code-mix).
- Multilingual Data Systems: Handle data from 50+ languages and accents, focusing on language-aware normalization and segmentation.
- Training Data Engine: Develop pipelines that continuously generate improved training data from production data via active learning loops and smart data selection strategies.
- Evaluation & Benchmarking Pipelines: Create scalable evaluation datasets and automate quality tracking for various systems including Automatic Speech Recognition (ASR) and Text-to-Speech (TTS).
- Data Infrastructure for Research: Collaborate closely with the research team to facilitate rapid experimentation and reduce iteration times significantly.
Qualifications
- Strong fundamentals in data structures, systems, and pipelines.
- Experience with large-scale data processing, with a preference for audio and text data.
- Ability to work with messy, unstructured real-world data.
- Strong coding skills in Python; systems experience is a plus.
- Understanding of machine learning/data pipelines, including training and evaluation processes.
- Excellent data curation skills.
Preferred Skills
- Experience with speech/audio data (Automatic Speech Recognition (ASR) or Text-to-Speech (TTS)).
- Familiarity with multilingual datasets.
- Experience with streaming systems, such as Kafka.
- Exposure to data-centric AI and data quality frameworks.
Experience
Minimum experience details are not specified.
Environment
Work setting and location details are not specified.
Salary
Salary information is not specified.
Growth Opportunities
Career advancement opportunities are not specified.
Benefits
Details on offered benefits are not specified.
