- Qwen-VL: A Versatile Vision-Language Model for Understanding . . .
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline
- Q -VL: A VERSATILE V M FOR UNDERSTANDING, L ING AND EYOND - OpenReview
The overall network architecture of Qwen-VL consists of three components and the details of model parameters are shown in Table 1: Large Language Model: Qwen-VL adopts a large language model as its foundation component The model is initialized with pre-trained weights from Qwen-7B (Qwen, 2023)
- LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation
Remarkably, LLaVA-MoD-2B surpasses Qwen-VL-Chat-7B with an average gain of 8 8\%, using merely $0 3\%$ of the training data and 23\% trainable parameters The results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for developing efficient MLLMs
- Qwen2 Technical Report - OpenReview
This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models We release a comprehensive suite of foundational and instruction-tuned
- Alleviating Hallucination in Large Vision-Language Models with. . .
To assess the capability of our proposed ARA model in reducing hallucination, we employ three widely used LVLM models (LLaVA-1 5, Qwen-VL, and mPLUG-Owl2) across four benchmarks Our empirical observations suggest that by utilizing fitting retrieval mechanisms and timing the retrieval judiciously, we can effectively mitigate the hallucination
- Qwen2. 5 Technical Report - OpenReview
In this report, we introduce Qwen2 5, a comprehensive series of large language models (LLMs) designed to meet diverse needs Compared to previous iterations, Qwen 2 5 has been significantly improved during both the pre-training and post-training stages
- Visual CoT: Advancing Multi-Modal Language Models with a. . .
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is
- You Know What Im Saying: Jailbreak Attack via Implicit Reference
Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding $\textbf{90}$% on most models, including GPT-4o, Claude-3 5-Sonnet, and Qwen-2-72B Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method
|