Open to Work

Kourosh Meshgi, PhD

Senior Applied Scientist & Technical Lead

ML · Computer Vision · NLP · Generative AI

I build AI systems that work, from research prototype to production deployment. With a PhD from Kyoto University and 10+ years spanning academic research, government-funded projects, and industry leadership, I specialize in the intersection of computer vision, natural language processing, and multimodal generative AI.

Based in Arlington, VA US Permanent Resident

What I Do

Generative AI & LLM Systems

Fine-tuning, red-teaming, RAG pipelines, multimodal LLMs, agentic workflows (A2A, MCP). From model evaluation to production-ready deployment.

GPTLLaVAQwen-VLCLIPLoRADPORAG

Computer Vision & Perception

Object tracking, detection, segmentation, scene understanding, video analysis. 15+ papers on robust visual tracking under real-world conditions published in CVPR, ICIP, ACCV.

Object TrackingYOLOFlorenceOpenCVPyTorchVideo Analysis

NLP, Speech & Multimodal Systems

Text classification, multi-task learning, named entity recognition, ASR-based systems, cross-lingual knowledge. Published in ACL and Interspeech.

BERTTransformersASRMTLNERMultilingual

Featured Projects

SelfMinds AI · 2025–Present

Automation-from-Demonstration (AfD)

Converts raw screen recordings into executable automation plans through self-corrective imitation learning, the system watches a human demonstrate a task once, infers the underlying intent, and synthesizes a replicable policy without manual scripting. UI understanding fuses OCR, DOM, SOM (visual element detection and indexing), VLM, and transformer-based template matching into a unified perceptual layer. These signals are unified through graph-grounded UI reasoning, where screen elements become nodes in a relational graph and actions are resolved as edges, making the planner robust to layout changes and dynamic content. Context-aware trajectory synthesis then generates action sequences that adapt to state at each step rather than replaying a fixed script, and adaptive policy refinement continuously corrects drift using execution feedback, closing the loop between observation and action.

↳ Enables non-technical users to automate complex UI workflows without writing a single line of code.

VLMSOMOCRDOM ParsingGraph ReasoningImitation LearningAgentic AITransformer

SelfMinds AI · 2025–Present

Multimodal Video RAG System

A full-stack multimodal RAG pipeline that extracts every computable signal from video, visual, speech, textual, and semantic, and unifies them into a single timestamp-aligned knowledge structure. Scene boundaries are detected automatically; keyframes are captioned by a VLM; speech is transcribed and force-aligned at the word level; on-screen text is recovered via EAST detection and OCR; speakers are diarized without supervision and identified by name through LLM reasoning over transcript, title, and context signals. CLIP embeddings indexed in FAISS enable sub-second semantic frame retrieval by text or image query, while a hybrid search layer fuses transcript and OCR hits. An LLM-powered query planner routes natural language questions to the right analysis path, returning answers, timestamps, or assembled highlight reels.

↳ Turns any unstructured video into a fully queryable knowledge base, ask a question and get a moment.

RAGCLIPFAISSVLMWhisperSpeaker DiarizationOCRLLM Query Planning

SelfMinds AI · 2025–Present

Adaptive Agentic AI Simulations

Orchestrated multi-agent systems for personalized user behavior modeling, building behavior-driven simulations that support realistic agentic task execution and long-horizon planning. Agents communicate via the A2A protocol, each instantiated with a distinct persona and curated information context. Evaluation coverage is comprehensive: from replicating user perception and modeling decision-making behavior and reasoning, to identifying pain points and surfacing concrete improvement avenues, closing the loop between simulation and actionable product insight. A core addition: structured internal dialogue between agents, governed by a dedicated moderator agent that detects and corrects persona drift in real time, preventing the homogenization that collapses multi-agent discussions into a single voice and steering participants toward complementary, synergistic contributions.

↳ Full-spectrum evaluation: perception → behavior → reasoning → pain points → improvement, grounded in realistic persona-driven simulation at scale.

Multi-AgentAgent-to-AgentPersona SimulationBehavior ModelingAgent ModerationLong-horizon Planning

Yodayoda · 2020–2021

World Map Auto-Generation for Robots

Led a 7-person team building vision-based SLAM systems for urban autonomous navigation, producing enriched context-aware world models that go well beyond traditional navigation layers. The multi-modal perception pipeline fused camera-based scene understanding, LiDAR, IMU, and photogrammetry-driven 3D point cloud reconstruction with semantic environment parsing, enabling detection and reasoning over traffic dynamics, pedestrian behavior, urban structures, occlusions, road hazards, traffic signal patterns, and risk-prone intersections. External signals (crash statistics, environmental conditions, behavioral traffic patterns) were incorporated to model real-world operational risk directly in the map. A specialized focus on occlusion reasoning and perception failure analysis identified scenarios where both human drivers and autonomous systems fail to detect hidden objects or hazardous trajectories, embedding that safety awareness into the mapping framework. Robustness was validated across diverse traffic densities, weather, lighting, and urban layouts, including complex Japanese road systems and left-side driving, and extended by large-scale simulation environments covering globally diverse edge cases.

↳ Spatial AI at the intersection of computer vision, sensor fusion, and human safety. Context-aware maps that make autonomous navigation interpretable and failure-resistant in real-world urban complexity.

SLAMLiDAR / IMU Fusion3D Point CloudObject DetectionScene UnderstandingOcclusion ReasoningAutonomous VehiclesTeam Lead

RIKEN AIP · 2019–2023

SHINRA: Multilingual Wikipedia Knowledge Structuring

Existing structured knowledge bases (Wikidata, DBpedia, Freebase) are notoriously noisy, schema mismatches, ambiguous attributes, and sparse coverage across languages. SHINRA tackled this from both ends: a top-down Extended Named Entity (ENE) ontology of 219 fine-grained categories with 10–30 typed attributes each, combined with bottom-up population through a collaborative evaluation framework where research teams worldwide run their information extraction systems on the full Wikipedia and contribute outputs that are ensemble-merged. Two task tracks ran annually: the Japanese Attribute Extraction task (extracting structured attribute values from Wikipedia pages across 45+ entity categories, with 15+ systems and 40+ committee members) and the Multilingual Categorization task (classifying Wikipedia pages in 30 languages into all 219 ENE categories). Ensemble learning over participant outputs consistently outperformed every individual system, e.g., Airport attribute extraction jumped from 72 (best single system) to 87 F1 through ensemble merging.

↳ 30 languages, 219 entity categories, millions of Wikipedia pages structured. Global participation across 10 countries; 40+ PC members from Cambridge, UIUC, NII, Tohoku, and beyond.

NERMultilingual NLPKnowledge GraphsInformation ExtractionEnsemble LearningCross-lingual

Kyoto Robotics · 2015

Robotic Arm Trajectory Planning for Confined Spaces

Designed optimal trajectory planning and collision avoidance for industrial robotic arms operating in confined spaces alongside humans. Integrated structured-light 3D sensing for real-time obstacle and human detection, built predictive collision models to intercept hazards before they occurred, and implemented a digital kill switch, an emergency stop mechanism that instantly overrides all robot motion when human proximity is detected. Validated across a wide range of simulated and controlled scenarios per industrial safety standards.

↳ Safety-critical system; earned the NEDO Prize from Japan's Ministry of Economy, Trade & Industry.

Path Planning3D SensingCollision AvoidanceStructured LightPredictive ModelingHuman-Robot Safety

Kyoto University / RIKEN · 2015–2021 · Post-Kei & Samurai Projects

Object Tracking under Real-World Uncertainty

A government-funded R&D project in Japan, a public transportation scenario demanding a tracker that simultaneously handles conflicting challenges: occlusion, clutter, scale change, motion blur, and low resolution, all within a near real-time processing budget. No single classifier handles the full range, and high-performance models are too slow. The core insight was asymmetric co-tracking: a fast but naive classifier handles the incoming stream continuously, and when it encounters uncertainty it queries a slower but more knowledgeable classifier for guidance, active learning as the bridge between speed and accuracy, modified to respect tight latency constraints.

Follow-on funding unlocked a sequence of extensions, each addressing a remaining failure mode. The fast classifier became an ensemble where each member processed overlapping data windows in a boosting arrangement, spreading the version space more effectively. A mixture of long and short-term memories gave the tracker persistence through extended occlusions. Adversarially generated training samples hardened the model against domain shift. Q-learning replaced hand-tuned heuristics to balance when to consolidate long-term memory versus adapt short-term. An active critic mechanism generated maximally informative samples to promote tighter collaboration between co-learners. Later work embedded tacit knowledge from intermediate CNN layers and applied reinforcement learning to adapt correlation filter parameters on the fly.

↳ Delivered on time, results covered in the press. Funded extensions across two government projects spanning six years and 14 publications.

Visual TrackingActive LearningCo-TrackingEnsemble MethodsQ-LearningAdversarial TrainingReinforcement LearningComputer Vision

Kyoto University / RIKEN · 2015–2022

Smart ASR Captioning for Language Learners

Started with a meta-analysis that uncovered a key empirical overlap: ASR errors and L2 listening difficulties co-occur on the same speech segments, phonetically dense clusters, fast speech, and low-frequency vocabulary trip up both speech recognizers and second-language listeners alike, making ASR error rate a proxy signal for human listening difficulty without any learner annotations. That finding grounded the core system: PSC, Partial and Synchronized Caption, where forced alignment positions each word in time and the system selectively reveals only the hard segments, synchronized to the audio, scaffolding without dependency. From there, each study added a layer: exploiting ASR errors to choose which words to show, then learner-adaptive personalization through click feedback, then sentence complexity as an additional difficulty signal, then simulated annealing to auto-tune difficulty thresholds per proficiency level, and finally self-regulation training that gradually reduces caption density as learners improve.

↳ Comprehension and retention improved significantly (p<0.001) vs. no-caption control. Widely cited and praised in the CALL community. 12 papers, 100+ citations.

ASRForced AlignmentError AnalysisAdaptive CaptionsL2 ListeningActive LearningNLP

RIKEN AIP · 2023–2024

LLM Red-Teaming & Guardrail Alignment

Fine-tuned LLMs for production deployment and conducted systematic red-teaming to surface safety failures and misalignment. Built evaluation pipelines using LLM-as-a-Judge, RAGA relevancy scoring, and rubric-based assessments. Designed and implemented guardrail alignment layers, prompt injection defense, output filtering, and behavioral constraints, ensuring robust, safe deployment at scale.

↳ Responsible AI at production scale, bridging research-quality safety practices with real deployment requirements.

Red-TeamingLLM Fine-TuningGuardrailsLLM-as-JudgeRAGATrust & SafetyAlignment

Experience

Startup · Founder

May 2025 – Present

Tech Lead & Chief Scientist

SelfMinds AI · Arlington, VA

Founded and leading AI R&D. Building agentic AI systems, multimodal RAG pipelines, and automation-from-demonstration frameworks.

Academic R&D

Apr 2019 – Dec 2025

Senior Research Scientist

RIKEN National Research Institute (AIP) · Tokyo, Japan / Remote

Led research on generative AI, multimodal LLMs, multi-task learning, and SHINRA. Published in ACL, Interspeech. Distinguished Reviewer for ACL, AAAI, CVPR, ICML, ECCV.

Industry R&D

Mar 2020 – Mar 2021

Computer Vision Lead, R&D

Yodayoda Co. Ltd. · Kyoto, Japan

Led 7-person team building autonomous map generation systems for robots and self-driving vehicles. Secured government funding.

Postdoctoral Research

Nov 2015 – Mar 2019

Postdoctoral Researcher

Kyoto University · Kyoto, Japan

Three concurrent government-funded projects. Led a team of 8. Won JSPS Kakenhi grant and Kyoto University ICT Innovation Award.

Industry

Apr – Oct 2015

R&D Engineer

Kyoto Robotics · Kusatsu, Japan

Path planning, collision detection, and grasp planning for industrial robotic arms in confined spaces. Won NEDO Prize from Japan Ministry of Economy.

Academic Research

2012 – 2014

Researcher & Research Assistant

Kyoto University · Kyoto, Japan

Object tracking with RGB-D data; face reconstruction; bioimaging and cell signaling ML applications.

Diverse Roles

2004 – 2011

Early Career

Various · Tehran, Iran

Robotics research (RoboCup 3rd place, 2005), telecom systems, software QA, R&D management, and teaching assistant roles.

Research Snapshot

44

Conference Papers

7

Journal Papers

1

Book Chapter

500+

Citations

ACLCVPRInterspeechICIPACCVWWWMVACRV
CVPR 2018

Efficient Diverse Ensemble for Discriminative Co-Tracking

Built DEDT, an ensemble tracker with artificial diversity generation and active learning. Outperformed state-of-the-art on OTB50, OTB100, and VOT2015.

Computer VisionActive LearningObject Tracking
ACL 2022

Q-Learning Scheduler for Multi-Task Learning

Applied reinforcement learning to MTL training scheduling, outperformed 11 baseline schedulers across classification, tagging, and translation tasks.

NLPMulti-Task LearningReinforcement Learning
Interspeech 2021

Adaptive Listening Difficulty Detection for L2 Learners

Used ASR error patterns as a proxy for human listening difficulty, enabling smart adaptive captions that significantly improved comprehension (p<0.001).

ASRSpeechLanguage Learning

Full list on Google Scholar. Explore papers in depth →

Awards & Honors

2017

JSPS Kakenhi: Grants-in-Aid for Scientific Research, Japan Society for the Promotion of Science

2017

ICT Innovation Award: Kyoto University

2017

IEEE Best Paper Award: ICSIPA'17 (Efficient Asymmetric Co-Tracking)

2015

NEDO Prize: Japan Ministry of Economy, Trade and Industry

2011

MEXT Scholarship: Monbukagakusho, Ministry of Education, Japan (2011–2014)

2010

Exceptional Talent: Amirkabir University of Technology (Ranked 1st, AI Dept.)

2005

3rd Place: RoboCup International, Soccer Simulation League, Osaka, Japan

Reviewer & Service

Distinguished Reviewer & Organizing Committee: ACL, AAAI, COLING, CVPR, ICML, ECCV, ACCV, Interspeech. Top Reviewer of 2019 (Publons). IEEE Member since 2009.

Let's work together.

I'm currently open to Senior Applied Scientist, Staff Scientist, and Technical Lead roles, particularly in generative AI, multimodal systems, or computer vision. I work well in collaborative environments where I can combine hands-on technical depth with team leadership.

Based in Arlington, VA. US Permanent Resident (Green Card). Open to remote, hybrid, or on-site (DC metro area preferred).