Disappointed: Vision feature is unusable (Outputs infinite <pad> tokens)
Before detailing the main issue, I want to note that the standard text/conversation mode works fine. In fact, it can even be jailbroken quite easily (for example, getting it to generate product serial numbers, regardless of whether they actually work).
However, I was really excited to test the vision capabilities of this model, and I'm quite disappointed to find that it doesn't seem to work at all for image recognition.
Whenever I attempt to process an image, the model gets stuck in an endless generation loop. I have tried adjusting the generation parameters to force a more deterministic output—specifically by setting temperature=0 and completely disabling the thinking feature—but the result remains exactly the same.
Output:
<pad> <pad> <pad> <pad> <pad> <pad> <pad> ... [repeats infinitely]