Kourosh Meshgi, PhD
Senior Applied Scientist & Technical Lead
I build AI systems that work, from research prototype to production deployment. With a PhD from Kyoto University and 10+ years spanning academic research, government-funded projects, and industry leadership, I specialize in the intersection of computer vision, natural language processing, and multimodal generative AI.
Based in Arlington, VA US Permanent Resident
Expertise
What I Do
Generative AI & LLM Systems
Fine-tuning, red-teaming, RAG pipelines, multimodal LLMs, agentic workflows (A2A, MCP). From model evaluation to production-ready deployment.
Computer Vision & Perception
Object tracking, detection, segmentation, scene understanding, video analysis. 15+ papers on robust visual tracking under real-world conditions published in CVPR, ICIP, ACCV.
NLP, Speech & Multimodal Systems
Text classification, multi-task learning, named entity recognition, ASR-based systems, cross-lingual knowledge. Published in ACL and Interspeech.
Portfolio
Featured Projects
Automation-from-Demonstration (AfD)
Converts raw screen recordings into executable automation plans through self-corrective imitation learning, the system watches a human demonstrate a task once, infers the underlying intent, and synthesizes a replicable policy without manual scripting. UI understanding fuses OCR, DOM, SOM (visual element detection and indexing), VLM, and transformer-based template matching into a unified perceptual layer. These signals are unified through graph-grounded UI reasoning, where screen elements become nodes in a relational graph and actions are resolved as edges, making the planner robust to layout changes and dynamic content. Context-aware trajectory synthesis then generates action sequences that adapt to state at each step rather than replaying a fixed script, and adaptive policy refinement continuously corrects drift using execution feedback, closing the loop between observation and action.
↳ Enables non-technical users to automate complex UI workflows without writing a single line of code.
Multimodal Video RAG System
A full-stack multimodal RAG pipeline that extracts every computable signal from video, visual, speech, textual, and semantic, and unifies them into a single timestamp-aligned knowledge structure. Scene boundaries are detected automatically; keyframes are captioned by a VLM; speech is transcribed and force-aligned at the word level; on-screen text is recovered via EAST detection and OCR; speakers are diarized without supervision and identified by name through LLM reasoning over transcript, title, and context signals. CLIP embeddings indexed in FAISS enable sub-second semantic frame retrieval by text or image query, while a hybrid search layer fuses transcript and OCR hits. An LLM-powered query planner routes natural language questions to the right analysis path, returning answers, timestamps, or assembled highlight reels.
↳ Turns any unstructured video into a fully queryable knowledge base, ask a question and get a moment.
Adaptive Agentic AI Simulations
Orchestrated multi-agent systems for personalized user behavior modeling, building behavior-driven simulations that support realistic agentic task execution and long-horizon planning. Agents communicate via the A2A protocol, each instantiated with a distinct persona and curated information context. Evaluation coverage is comprehensive: from replicating user perception and modeling decision-making behavior and reasoning, to identifying pain points and surfacing concrete improvement avenues, closing the loop between simulation and actionable product insight. A core addition: structured internal dialogue between agents, governed by a dedicated moderator agent that detects and corrects persona drift in real time, preventing the homogenization that collapses multi-agent discussions into a single voice and steering participants toward complementary, synergistic contributions.
↳ Full-spectrum evaluation: perception → behavior → reasoning → pain points → improvement, grounded in realistic persona-driven simulation at scale.
World Map Auto-Generation for Robots
Led a 7-person team building vision-based SLAM systems for urban autonomous navigation, producing enriched context-aware world models that go well beyond traditional navigation layers. The multi-modal perception pipeline fused camera-based scene understanding, LiDAR, IMU, and photogrammetry-driven 3D point cloud reconstruction with semantic environment parsing, enabling detection and reasoning over traffic dynamics, pedestrian behavior, urban structures, occlusions, road hazards, traffic signal patterns, and risk-prone intersections. External signals (crash statistics, environmental conditions, behavioral traffic patterns) were incorporated to model real-world operational risk directly in the map. A specialized focus on occlusion reasoning and perception failure analysis identified scenarios where both human drivers and autonomous systems fail to detect hidden objects or hazardous trajectories, embedding that safety awareness into the mapping framework. Robustness was validated across diverse traffic densities, weather, lighting, and urban layouts, including complex Japanese road systems and left-side driving, and extended by large-scale simulation environments covering globally diverse edge cases.
↳ Spatial AI at the intersection of computer vision, sensor fusion, and human safety. Context-aware maps that make autonomous navigation interpretable and failure-resistant in real-world urban complexity.
SHINRA: Multilingual Wikipedia Knowledge Structuring
Existing structured knowledge bases (Wikidata, DBpedia, Freebase) are notoriously noisy, schema mismatches, ambiguous attributes, and sparse coverage across languages. SHINRA tackled this from both ends: a top-down Extended Named Entity (ENE) ontology of 219 fine-grained categories with 10–30 typed attributes each, combined with bottom-up population through a collaborative evaluation framework where research teams worldwide run their information extraction systems on the full Wikipedia and contribute outputs that are ensemble-merged. Two task tracks ran annually: the Japanese Attribute Extraction task (extracting structured attribute values from Wikipedia pages across 45+ entity categories, with 15+ systems and 40+ committee members) and the Multilingual Categorization task (classifying Wikipedia pages in 30 languages into all 219 ENE categories). Ensemble learning over participant outputs consistently outperformed every individual system, e.g., Airport attribute extraction jumped from 72 (best single system) to 87 F1 through ensemble merging.
↳ 30 languages, 219 entity categories, millions of Wikipedia pages structured. Global participation across 10 countries; 40+ PC members from Cambridge, UIUC, NII, Tohoku, and beyond.
Robotic Arm Trajectory Planning for Confined Spaces
Designed optimal trajectory planning and collision avoidance for industrial robotic arms operating in confined spaces alongside humans. Integrated structured-light 3D sensing for real-time obstacle and human detection, built predictive collision models to intercept hazards before they occurred, and implemented a digital kill switch, an emergency stop mechanism that instantly overrides all robot motion when human proximity is detected. Validated across a wide range of simulated and controlled scenarios per industrial safety standards.
↳ Safety-critical system; earned the NEDO Prize from Japan's Ministry of Economy, Trade & Industry.
Object Tracking under Real-World Uncertainty
A government-funded R&D project in Japan, a public transportation scenario demanding a tracker that simultaneously handles conflicting challenges: occlusion, clutter, scale change, motion blur, and low resolution, all within a near real-time processing budget. No single classifier handles the full range, and high-performance models are too slow. The core insight was asymmetric co-tracking: a fast but naive classifier handles the incoming stream continuously, and when it encounters uncertainty it queries a slower but more knowledgeable classifier for guidance, active learning as the bridge between speed and accuracy, modified to respect tight latency constraints.
Follow-on funding unlocked a sequence of extensions, each addressing a remaining failure mode. The fast classifier became an ensemble where each member processed overlapping data windows in a boosting arrangement, spreading the version space more effectively. A mixture of long and short-term memories gave the tracker persistence through extended occlusions. Adversarially generated training samples hardened the model against domain shift. Q-learning replaced hand-tuned heuristics to balance when to consolidate long-term memory versus adapt short-term. An active critic mechanism generated maximally informative samples to promote tighter collaboration between co-learners. Later work embedded tacit knowledge from intermediate CNN layers and applied reinforcement learning to adapt correlation filter parameters on the fly.
↳ Delivered on time, results covered in the press. Funded extensions across two government projects spanning six years and 14 publications.
Smart ASR Captioning for Language Learners
Started with a meta-analysis that uncovered a key empirical overlap: ASR errors and L2 listening difficulties co-occur on the same speech segments, phonetically dense clusters, fast speech, and low-frequency vocabulary trip up both speech recognizers and second-language listeners alike, making ASR error rate a proxy signal for human listening difficulty without any learner annotations. That finding grounded the core system: PSC, Partial and Synchronized Caption, where forced alignment positions each word in time and the system selectively reveals only the hard segments, synchronized to the audio, scaffolding without dependency. From there, each study added a layer: exploiting ASR errors to choose which words to show, then learner-adaptive personalization through click feedback, then sentence complexity as an additional difficulty signal, then simulated annealing to auto-tune difficulty thresholds per proficiency level, and finally self-regulation training that gradually reduces caption density as learners improve.
↳ Comprehension and retention improved significantly (p<0.001) vs. no-caption control. Widely cited and praised in the CALL community. 12 papers, 100+ citations.
LLM Red-Teaming & Guardrail Alignment
Fine-tuned LLMs for production deployment and conducted systematic red-teaming to surface safety failures and misalignment. Built evaluation pipelines using LLM-as-a-Judge, RAGA relevancy scoring, and rubric-based assessments. Designed and implemented guardrail alignment layers, prompt injection defense, output filtering, and behavioral constraints, ensuring robust, safe deployment at scale.
↳ Responsible AI at production scale, bridging research-quality safety practices with real deployment requirements.
Career
Experience
May 2025 – Present
Tech Lead & Chief Scientist
SelfMinds AI · Arlington, VA
Founded and leading AI R&D. Building agentic AI systems, multimodal RAG pipelines, and automation-from-demonstration frameworks.
Apr 2019 – Dec 2025
Senior Research Scientist
RIKEN National Research Institute (AIP) · Tokyo, Japan / Remote
Led research on generative AI, multimodal LLMs, multi-task learning, and SHINRA. Published in ACL, Interspeech. Distinguished Reviewer for ACL, AAAI, CVPR, ICML, ECCV.
Mar 2020 – Mar 2021
Computer Vision Lead, R&D
Yodayoda Co. Ltd. · Kyoto, Japan
Led 7-person team building autonomous map generation systems for robots and self-driving vehicles. Secured government funding.
Nov 2015 – Mar 2019
Postdoctoral Researcher
Kyoto University · Kyoto, Japan
Three concurrent government-funded projects. Led a team of 8. Won JSPS Kakenhi grant and Kyoto University ICT Innovation Award.
Apr – Oct 2015
R&D Engineer
Kyoto Robotics · Kusatsu, Japan
Path planning, collision detection, and grasp planning for industrial robotic arms in confined spaces. Won NEDO Prize from Japan Ministry of Economy.
2012 – 2014
Researcher & Research Assistant
Kyoto University · Kyoto, Japan
Object tracking with RGB-D data; face reconstruction; bioimaging and cell signaling ML applications.
2004 – 2011
Early Career
Various · Tehran, Iran
Robotics research (RoboCup 3rd place, 2005), telecom systems, software QA, R&D management, and teaching assistant roles.
Publications
Research Snapshot
44
Conference Papers
7
Journal Papers
1
Book Chapter
500+
Citations
Venues
Efficient Diverse Ensemble for Discriminative Co-Tracking
Built DEDT, an ensemble tracker with artificial diversity generation and active learning. Outperformed state-of-the-art on OTB50, OTB100, and VOT2015.
Q-Learning Scheduler for Multi-Task Learning
Applied reinforcement learning to MTL training scheduling, outperformed 11 baseline schedulers across classification, tagging, and translation tasks.
Adaptive Listening Difficulty Detection for L2 Learners
Used ASR error patterns as a proxy for human listening difficulty, enabling smart adaptive captions that significantly improved comprehension (p<0.001).
Full list on Google Scholar. Explore papers in depth →
Visual Tracking under Adversity
Robust object tracking in challenging real-world conditions, occlusion, scale change, illumination variation, background clutter, low resolution. Developed ensemble methods, active learning, Q-learning memory balancing, and adversarial domain adaptation for reliable multi-domain tracking.
Multi-Task & Transfer Learning for NLP
Learning shared representations across diverse NLP tasks without catastrophic forgetting. Explored reinforcement learning for task scheduling, uncertainty-guided regularization, and fine-grained conversational intent classification for implicit attack detection.
AI for Language Learning
Adaptive ASR-powered captioning systems for second-language learners. A decade-long program combining speech processing, active learning, and educational technology, producing real-time partial synchronized captions that measurably improve L2 listening comprehension.
Deep Dive
Research in Depth
16 selected papers across four research threads. Drag or use arrow keys to navigate. Click any card for the full STAR breakdown.
Datasets
Recognition
Awards & Honors
JSPS Kakenhi: Grants-in-Aid for Scientific Research, Japan Society for the Promotion of Science
ICT Innovation Award: Kyoto University
IEEE Best Paper Award: ICSIPA'17 (Efficient Asymmetric Co-Tracking)
NEDO Prize: Japan Ministry of Economy, Trade and Industry
MEXT Scholarship: Monbukagakusho, Ministry of Education, Japan (2011–2014)
Exceptional Talent: Amirkabir University of Technology (Ranked 1st, AI Dept.)
3rd Place: RoboCup International, Soccer Simulation League, Osaka, Japan
Reviewer & Service
Distinguished Reviewer & Organizing Committee: ACL, AAAI, COLING, CVPR, ICML, ECCV, ACCV, Interspeech. Top Reviewer of 2019 (Publons). IEEE Member since 2009.
Get in Touch
Let's work together.
I'm currently open to Senior Applied Scientist, Staff Scientist, and Technical Lead roles, particularly in generative AI, multimodal systems, or computer vision. I work well in collaborative environments where I can combine hands-on technical depth with team leadership.
Based in Arlington, VA. US Permanent Resident (Green Card). Open to remote, hybrid, or on-site (DC metro area preferred).
