Benchmark/comparison against MiniGPT-4? #44

152334H · 2023-04-18T05:35:53Z

152334H
Apr 18, 2023

Or just thoughts about which architecture should be expected to perform better with the same training data

ChunyuanLI · 2023-04-19T20:18:27Z

ChunyuanLI
Apr 19, 2023
Collaborator

It is great to see the community's recognition and excitement about this direction; Both pieces of work are taken independently during the same period.

IMHO, LLaVA is unique at three aspects, see below.

More rigorous results: LLaVA has rigorous quantitative results, including the level of similarity with Visual Chat and GPT-4, the SoTA accuracy on Science QA, and ablation studies on data iteration and model design. Mini GPT-4, on the other hand, lacks quantitative results.
Quality of Chat Demo: LLaVA can reproduce results for visual reasoning examples in GPT-4 paper and has strong OCR capabilities. These features are impressive and unique, making it possibly the closest demo to Multimodal GPT-4. Check results: https://llava-vl.github.io
Lastly, it should be clarified that the focus of this line of work is data-centric, not model-centric. As the differences in models are diminishing, data quality has a greater impact on results. We released our multi-modal instruction following data, to replicate Multimodal GPT-4. The high quality data is all you need (compare which, the architecture is secondary).

1 reply

Pilot-LH Apr 26, 2023

I agree with the author that LLaVA is better than MiniGPT-4 in terms of demo quality and comprehensive analysis.
Regarding the last point, I attempted to fine-tune the BLIP-2 model (based on Flan-T5) using high-quality data provided here, but did not achieve outputs as interesting as LLaVA or MiniGPT-4. While it's possible that I didn't execute the process properly, I'm curious if architecture is indeed a secondary factor. Can Vicuna be substituted with the original LLaMA or other open-sourced models to yield comparable results? Such a development could result in a fully open-sourced model with greater impact, in my opinion.

ChunyuanLI · 2023-04-19T22:15:53Z

ChunyuanLI
Apr 19, 2023
Collaborator

Comparison of LLaVA and mini-GPT4 on "Extreme Ironing" example from the OpenAI GPT-4 technique report.

LLaVA

mini-GPT4

Run 1:

Run 2:

3 replies

xdevfaheem May 8, 2023

Can You Show The Configs You Set to Get this Result?

i really like LLaVA, I really Wants LLaVA To Perform Better Than mini GPT-4.

ldfandian Jul 9, 2023

I feel it is hard to believe LLaVa is better.

miniGPT-4 use "BLIP-2 QFormer + Project Layer" vs LLaVa use "purely Project Layer".
Technically, miniGPT-4 is able to handle more sophisticated scenarios.

kyugorithm Jan 1, 2024

Regarding the vision encoder, it is true that BLIP-2 QFormer outperforms CLIP significantly in terms of performance. But LLaVA fine-tunes the LLM (Vicuna) model. Moreover, I believe LLaVA employs a more effective data-acquisition approach.

xdevfaheem · 2023-05-08T03:13:52Z

xdevfaheem
May 8, 2023

How Will We Mitigate Hallucination issue?

1 reply

Glavin001 Jul 10, 2023

My understanding of hallucinations is that training datasets often include answers which the LLM may not know how to get, such as expecting them to reply with a fact the LLM actually doesn't know, thus training it to make up reasonable looking information instead of accurate.
In the case of vision models, I wonder if the visual encoding doesn't provide enough detail for all of the examples so the LLM learns "make up something reasonable in this scene" instead of "I see this from the visual input".

mostafa-adel · 2023-12-09T21:02:50Z

mostafa-adel
Dec 9, 2023

0 replies

kinthaiofficial · 2026-04-29T00:53:01Z

kinthaiofficial
Apr 29, 2026

Benchmarking vision-language models for agent use cases requires a different evaluation lens than the standard VQA/captioning benchmarks — the question isn't just "can it describe the image" but "can it reliably extract structured information that an agent can act on?"

A few dimensions that matter for agent-oriented VLM evaluation:

Structured extraction accuracy — given a screenshot, invoice, or document image, can the model produce valid JSON that matches a target schema? This is the practical use case for most agent integrations. Benchmark: schema conformance rate + field accuracy, not just text similarity.

Tool call correctness from visual context — "look at this chart and call the appropriate API with the data" — the VLM needs to bridge vision → structured output → action. This isn't captured in standard VQA.

Spatial reasoning for UI interaction — "click the button that says Submit" requires both text recognition and spatial localization. Different from document understanding.

Consistency across similar inputs — important for agents: if the same UI is rendered at different resolutions or zoom levels, does the VLM produce consistent structured output? Inconsistency breaks agent logic.

Cost per reliable extraction — a cheaper model that's 90% reliable may be better than an expensive one that's 95% reliable if the 5% failure cases are handled by retry.

For computer-use agents specifically, the relevant benchmark is: given N screenshots of common UI patterns, how often does the model correctly identify the interactive element to click? MiniGPT-4 and LLaVA serve different points on the cost/accuracy frontier here.

We're integrating vision models as perception layers in KinthAI's agent execution environment: https://blog.kinthai.ai/openclaw-multi-tenancy-why-vm-per-user-doesnt-scale — the multi-modal agent patterns are new territory.

What's the primary visual input modality you're evaluating for — natural images, documents, or UI screenshots?

0 replies

Benchmark/comparison against MiniGPT-4? #44

Uh oh!

Replies: 5 comments · 5 replies

Uh oh!

Uh oh!

ChunyuanLI Apr 19, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

ChunyuanLI Apr 19, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 5 replies

ChunyuanLI
Apr 19, 2023
Collaborator

ChunyuanLI
Apr 19, 2023
Collaborator