Hehai Lin

Hehai Lin is an incoming PhD student at The Hong Kong University of Science and Technology (Guangzhou). Before that, he received his Bachelor degree in the School of Artificial Intelligence from Sun Yat-sen University, supervised by Prof. Zhenhui Peng and Prof. Xiaobin Chang. He also works closely with Prof. Wenya Wang in the College of Computing and Data Science, Nanyang Technological University. His current research interests include Multimodal learning and reasoning, especially the self-evolve ability of Large vision-language models (LVLMs).


Education
  • Sun Yat-sen University

    Sun Yat-sen University

    B.S. in Artificial Intelligence Sep. 2020 - Jun. 2024

Honors & Awards
  • JinDao Scholarship 2024
  • Third-class Scholarship of Sun Yat-sen University 2023
  • Honorable Mention in the ECV2023 Competition 2023
  • The Fourth Place in the Second Intelligent Network Competition 2022
  • Second-class Scholarship of Sun Yat-sen University 2022
  • First-class Scholarship of Sun Yat-sen University 2021
Experience
  • Nanyang Technological University

    Nanyang Technological University

    Research Assistant (Supervisor is Wenya Wang) Jun. 2024 - Nov. 2024

Service
  • Reviewer of COLING 2025 and WWW 2025
Selected Publications (view all ) († corresponding author, * equal contribution)
Multi-view Analysis for Modality Bias in Multimodal Misinformation Benchmarks
Multi-view Analysis for Modality Bias in Multimodal Misinformation Benchmarks

Hehai Lin, Hui Liu, Shilei Cao, Haoliang Li, Wenya Wang

Under review 2024

Numerous multimodal misinformation benchmarks exhibit bias toward specific modalities, allowing detectors to make predictions based solely on one modality. Training detectors on such datasets can significantly degrade performance in real-world applications. While previous research has quantified modality bias at the dataset level or manually identified spurious correlations between modalities and labels, these approaches lack meaningful insights at the sample level and struggle to scale to the vast amount of online information. In this paper, we investigate the design for automatically recognizing modality bias at the sample level. Specifically, we introduce three views, namely modality benefit, modality flow, and modality causal effect, to quantify samples’ modality contribution based on different theories. To verify their effectiveness and discover the pattern of bias, we conduct a human evaluation on two benchmarks Fakeddit and MMFakeBench, and compare the performance of each view and their ensemble multi-view analysis. The experimental result indicates that multi-view analysis yields the highest performance and is aligned with human judgment in most samples. We further discuss the sensitivity and consistency of each view.

Multi-view Analysis for Modality Bias in Multimodal Misinformation Benchmarks
Multi-view Analysis for Modality Bias in Multimodal Misinformation Benchmarks

Hehai Lin, Hui Liu, Shilei Cao, Haoliang Li, Wenya Wang

Under review 2024

Numerous multimodal misinformation benchmarks exhibit bias toward specific modalities, allowing detectors to make predictions based solely on one modality. Training detectors on such datasets can significantly degrade performance in real-world applications. While previous research has quantified modality bias at the dataset level or manually identified spurious correlations between modalities and labels, these approaches lack meaningful insights at the sample level and struggle to scale to the vast amount of online information. In this paper, we investigate the design for automatically recognizing modality bias at the sample level. Specifically, we introduce three views, namely modality benefit, modality flow, and modality causal effect, to quantify samples’ modality contribution based on different theories. To verify their effectiveness and discover the pattern of bias, we conduct a human evaluation on two benchmarks Fakeddit and MMFakeBench, and compare the performance of each view and their ensemble multi-view analysis. The experimental result indicates that multi-view analysis yields the highest performance and is aligned with human judgment in most samples. We further discuss the sensitivity and consistency of each view.

Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks
Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

Jiayi He*, Hehai Lin*, Qingyun Wang, Yi Fung, Heng Ji

Under review 2024

While Vision-Language Models (VLMs) have shown remarkable abilities in visual and language reasoning tasks, they invariably generate flawed responses. Self-correction that instructs models to refine their outputs presents a promising solution to this issue. Previous studies have mainly concentrated on Large Language Models (LLMs), while the self-correction abilities of VLMs, particularly concerning both visual and linguistic information, remain largely unexamined. This study investigates the self-correction capabilities of VLMs during both inference and fine-tuning stages. We introduce a Self-Correction Learning (SCL) approach that enables VLMs to learn from their self-generated self-correction data through Direct Preference Optimization (DPO) without relying on external feedback, facilitating self-improvement. Specifically, we collect preferred and disfavored samples based on the correctness of initial and refined responses, which are obtained by two-turn self-correction with VLMs during the inference stage. Experimental results demonstrate that although VLMs struggle to self-correct effectively during iterative inference without additional fine-tuning and external feedback, they can enhance their performance and avoid previous mistakes through preference fine-tuning when their self-generated self-correction data are categorized into preferred and disfavored samples. This study emphasizes that self-correction is not merely a refinement process; rather, it should enhance the reasoning abilities of models through additional training, enabling them to generate high-quality responses directly without further refinement.

Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks
Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

Jiayi He*, Hehai Lin*, Qingyun Wang, Yi Fung, Heng Ji

Under review 2024

While Vision-Language Models (VLMs) have shown remarkable abilities in visual and language reasoning tasks, they invariably generate flawed responses. Self-correction that instructs models to refine their outputs presents a promising solution to this issue. Previous studies have mainly concentrated on Large Language Models (LLMs), while the self-correction abilities of VLMs, particularly concerning both visual and linguistic information, remain largely unexamined. This study investigates the self-correction capabilities of VLMs during both inference and fine-tuning stages. We introduce a Self-Correction Learning (SCL) approach that enables VLMs to learn from their self-generated self-correction data through Direct Preference Optimization (DPO) without relying on external feedback, facilitating self-improvement. Specifically, we collect preferred and disfavored samples based on the correctness of initial and refined responses, which are obtained by two-turn self-correction with VLMs during the inference stage. Experimental results demonstrate that although VLMs struggle to self-correct effectively during iterative inference without additional fine-tuning and external feedback, they can enhance their performance and avoid previous mistakes through preference fine-tuning when their self-generated self-correction data are categorized into preferred and disfavored samples. This study emphasizes that self-correction is not merely a refinement process; rather, it should enhance the reasoning abilities of models through additional training, enabling them to generate high-quality responses directly without further refinement.

All publications