Can generative AI write contextual text descriptions?

Posted on Monday, 24 March 2025 by Craig Abbott in Design and development

Tags: Artificial Intelligence, Assistive Technology

In 2025, Artificial Intelligence (AI) and Large Language Models (LLM) like ChatGPT, Gemini, Claude, and DeepSeek are being used for everything. Writing emails. Generating code. Even applying for jobs. But, can they write good text descriptions for images?

Many people rely on text descriptions, especially people who use screen readers. But, they’re often missing or poorly written. Given how widely AI is being used for writing content, we wanted to see how well an LLM can handle the task.

A good text description should help somebody to understand the image within the context of the surrounding text and why it was chosen. For example, if you had a photo of the Pyramids of Giza, you might describe it differently depending on where you use it.

On a:

Travel blog, you might focus on the location, terrain and cultural significance
History website, you might focus on the construction and historical background
Photography portfolio, you might focus on the lighting, composition and mood

We wanted to see if an AI can do this. So, we put some of the most popular LLM to the test, asking them to describe images with and without additional context, to try and understand:

Are AI-generated text descriptions good enough to use as they are?
Can they at least provide a strong starting point for a human to edit later?
Does providing additional context improve the quality of AI-generated text descriptions?

What we did

The LLM we tested

We set out to test the top five most popular LLM, which are:

However, we could not get Deepseek to process images, so we included a popular open-source LLM called Mini-CPM V instead.

Mini-CPM claims to have better image recognition abilities than ChatGPT, Claude and Gemini. With such a bold claim, it felt like an interesting one to include!

The prompts we used

We had the LLM describe each photo twice.

First, we provided the image on its own with no context, using the prompt:

Generate alternative text for this photo. Use plain language and 10 to 20 words to highlight the most important information for non-sighted people.

Then, we provided the image along with a paragraph to provide additional context, using the prompt:

Generate alternative text for this photo using the provided context. Use plain language and 10 to 20 words to highlight the most important information for non-sighted people. Provided context: "This line was replaced with a paragraph for context."

Before giving the images to the LLM, we removed any meta-data associated with the image file as we figured the LLM might read it and use it for clues. This is not necessarily a bad thing. For example, if it reads the location the photo was taken, then perhaps it might provide a better description. However, meta-data is never guaranteed to be in any photo, especially as a lot of tools and platforms strip it out when processing or uploading images. So, we chose to remove it.

The images we chose

Image 1: The "Friends Apartment" Building

I took this photo of my wife walking in front of the "Friends Apartment" on Bedford Street, in New York.

We chose it because it's a famous building that attracts thousands of tourists, yet it looks unassuming and reasonably generic.

We wanted to understand if the LLM could understand the cultural significance of the building or the location of it.

The additional context we provided to the LLM:

The image will be used on a fan website for "Friends", a popular American sitcom that aired from 1994 to 2004. Based in New York, it followed the lives and careers of six friends: Rachel, Ross, Monica, Chandler, Joey, and Phoebe. Friends is known for its witty humour, relatable characters, and iconic catchphrases. It quickly became a cultural phenomenon, influencing fashion, language, and sitcom storytelling. Decades after it finished, it remains one of the most rewatched series in television history.

Image 2: The Statue of Liberty Replica

I took this photo on a recent trip to Las Vegas in front of the New York-New York Hotel and Casino.

We chose this photo because the Statue of Liberty is one of the most instantly recognisable landmarks in the world. But, this isn't actually the Statue of Liberty.

We wanted to understand if the LLM could reason with what it found in the photo, such as the roller coaster, and come to the conclusion that this is not the actual Statue of Liberty.

The additional context we provided the LLM:

The image will be used on a travel website for "Las Vegas", a vibrant city famous for its huge casinos, luxury hotels, and themed attractions. The New York-New York Hotel and Casino stands out, with its striking replica of the New York City skyline, including the Statue of Liberty and the Brooklyn Bridge, which are intertwined with "The Big Apple" rollercoaster.

Image 3: The Gender Assumption

A head and shoulders portrait of a person on a bright street. They are black, with short black hair and wearing a black t-shirt. They have piercings, wearing a silver earring in their right ear and a silver U-shaped ring through the septum of their nose. They are smiling, looking off camera to their left. The background is blurred, but sunlit buildings, cars and palm trees are still visible. — Image credit, Aaron Amat / Alamy Stock Photo

This stock image was listed as: "Young African-American transgender, smiling confident at street."

We chose this photo because it's important to use inclusive language when describing images, and this includes making appropriate choices about people's pronouns or gender.

The person in the photo is described as being transgender, but there are no specifics about how they identify. So, we wanted to understand if the historical data used to train the LLM would influence the language they choose or assumptions they make.

The additional context we provided the LLM:

The image will be used on a social media post to promote inclusive language and respect for people's pronouns, which is essential for fostering dignity, respect, and belonging. We should acknowledge peoples identities and reduce discrimination for transgender and non-binary people. Misgendering can cause harm. By using inclusive language in professional, educational, and social settings, we promote equity, ensuring everyone is seen and valued for who they are.

Image 4: The 3 Finger Salute

A large crowd of people demonstrating in Thailand. They're wearing masks and have their right hand above their heads making a three finger salute, symbolising revolution. Their thumb is crossed over their palm, holding down their smallest finger, and their middle three fingers are together, pointing straight up. In the centre is a sign being held which reads: 'It is the music of a people who will not be slaves again.' — Image credit, SOPA Images Limited / Alamy Stock Photo

This stock image was listed as: "Protesters flash a three-finger-salute during the demonstration at the Democracy Monument."

This photo is from recent protests in Thailand, where people gathered at the Democracy Monument to demand change from their government. The three finger salute was taken from the Hunger Games, a book by Suzanne Collins. In the book, the salute is used as a gesture for revolution.

We chose this photo because without context, it's very difficult to understand what's going on. You can't see any facial expressions and the salute is very similar to other known gestures, like "the scouts promise".

The only real clue this is a protest is the sign being held up in the centre of the frame, which reads: "It is the music of a people who will not be slaves again."

We wanted to understand if the LLM could describe the image with enough context to make it meaningful. Can it understand the location, what is happening, or the mood of the people?

The additional context we provided the LLM:

The image will be used on a political blog. The three-finger salute, a symbol of revolution inspired by "The Hunger Games", became a powerful gesture during protests at the Democracy Monument in Thailand. It signifies defiance against authoritarian rule, unity among protesters, and a call for freedom, justice, and equality. Despite crackdowns, the salute remains an enduring emblem of resistance in Thailand’s fight for democracy.

Results and observations

Image 1

This is how each LLM described the Friends Apartment Building, with and without context.

LLM	Without context	With additional context
OpenAI ChatGPT (4o)	A historic brown-brick corner building with fire escapes. A red-and-blue café occupies the ground floor, with outdoor seating, trees, and street signs in a city setting.	The iconic Friends apartment building in New York, with its red-brick exterior and corner café, featured in the sitcom.
Anthropic Claude (3.5 Sonnet)	A historic apartment building in New York City with a red and blue corner cafe at street level. Trees line the sidewalk.	The iconic corner building from "Friends" featuring the red and blue storefront of Central Perk café.
Meta Llama (3.2 Vision)	A corner of an apartment building in New York City, showcasing its brown brick facade with ornate details.	The exterior of the apartment building featured in the popular American sitcom "Friends," which aired from 1994 to 2004.
Google Gemini (2.0 Flash)	The building exterior used in the TV show "Friends" as the apartment building, located in the West Village of New York City, with a red awning covering a cafe on the ground floor.	New York City apartment building exterior, iconic from the TV show "Friends."
MiniCPM V	A red brick apartment building with fire escapes stands near an urban street corner under a partly cloudy sky.	An iconic New York City building where "Friends," a popular American comedy show that aired between 1994-2004 was set.

Even without any additional context, Claude, Llama and Gemini all realised it was New York City. But, only Gemini managed to link it to the show, Friends.

When we added the additional context, all five LLM used it to give a more focused description. Gemini, for some reason, gave a worse response when it had access to more information.

The text description without additional context were all fairly generic, and most LLM seemed drawn to the colour of the café.

Image 2

This is how each LLM described the Statue of Liberty Replica in Las Vegas:

LLM	Without context	With additional context
OpenAI ChatGPT (4o)	A large green Statue of Liberty replica stands on a stone pedestal against a clear blue sky. A bird is perched on the statue's torch. A red roller coaster and buildings are visible in the background.	Replica of the Statue of Liberty at the New York-New York Hotel in Las Vegas, with a rollercoaster looping behind.
Anthropic Claude (3.5 Sonnet)	A replica of the Statue of Liberty stands outside the New York-New York Hotel in Las Vegas, with a red roller coaster visible behind it.	A replica of the Statue of Liberty stands tall against blue sky at Las Vegas' New York-New York casino.
Meta Llama (3.2 Vision)	The iconic Statue of Liberty replica situated outside New York-New York Hotel & Casino in Las Vegas.	A striking replica of the Statue of Liberty situated in front of the New York-New York Hotel and Casino in Las Vegas.
Google Gemini (2.0 Flash)	A tourist snapshot of the Statue of Liberty taken from below, showing its full figure and torch against a clear blue sky, highlighting the monument's scale and iconic presence.	Statue of Liberty replica at New York-New York Hotel in Las Vegas with blue sky.
MiniCPM V	An amusement park statue resembling Lady Liberty with surrounding buildings including one with red roller coaster tracks behind it.	New York-New York Hotel & Casino features New York City landmarks like The Statue of Liberty and Brooklyn Bridge, intertwined within "The Big Apple" rollercoaster in Las Vegas.

Although Gemini was one of the better LLM in the first test, this time it was the only one not to realise it was a replica, incorrectly describing the actual Statue of Liberty.

ChatGPT did ok, but described a bird that isn't there. Though, we can see why it perhaps thought the flames of the torch were the back-end of a pigeon.

Claude and Llama were able to correctly identify the location, while the others fell back to just describing objects in the image, such as the roller coaster.

When we provided additional context, most of the LLM generated a more focused response.

Claude, with access to more context, replaced the description of the rollercoaster and instead described the sky, which is arguably worse.

Image 3

This is how each LLM described the photo of the transgender person:

LLM	Without context	With additional context
OpenAI ChatGPT (4o)	A smiling person with short hair, a nose ring, and earrings wears a black T-shirt outdoors in a sunny urban setting with palm trees and blurred buildings in the background.	A smiling person with short hair and piercings stands outdoors in a sunny urban setting. Using inclusive language and respecting pronouns fosters dignity and belonging for all.
Anthropic Claude (3.5 Sonnet)	A person with a warm, genuine smile wearing a black t-shirt stands outdoors on a sunny day, with buildings and palm trees visible behind them.	A person with short dark hair and a nose piercing smiles warmly outdoors, wearing a black t-shirt.
Meta Llama (3.2 Vision)	Two young Black men standing together in front of a blurred background.	The image features two young Black individuals standing in front of a blurred background.
Google Gemini (2.0 Flash)	Sorry, I can't help with images of people yet.	Sorry, I can't help with images of people yet.
MiniCPM V	An African American woman wearing earrings with a smile; she is standing outdoors in what looks like a town setting or street with buildings behind her that are out of focus.	A person with short cropped hair smiles happily on the street while wearing earrings and rings in their nose, nostril, upper lip, and earlobe.

Three of the LLM failed this test.

Gemini just refused to caption it at all
MiniCPM described the person in the photo as a woman and used "her" pronouns
Llama hallucinated, describing two men in the photo

When we provided more context, all of the LLM used more neutral and inclusive language. But, Llama was still convinced there were two people for some reason, and Claude again removed details that made it's response arguably worse.

Image 4

This is how each LLM described the photo of the protest with the three finger salute:

LLM	Without context	With additional context
OpenAI ChatGPT (4o)	A large crowd of protesters raises three-finger salutes. A central sign reads, "IT IS THE MUSIC OF A PEOPLE WHO WILL NOT BE SLAVES AGAIN." Many wear masks.	A large crowd of protesters in Thailand raise three-finger salutes, a symbol of resistance against authoritarian rule. A sign in the foreground reads: "IT IS THE MUSIC OF A PEOPLE WHO WILL NOT BE SLAVES AGAIN." Many wear masks, showing unity in their fight for democracy.
Anthropic Claude (3.5 Sonnet)	A protest crowd with raised hands, many wearing masks. A cardboard sign reads "IT IS THE MUSIC OF A PEOPLE WHO WILL NOT BE SLAVES AGAIN."	Crowd raises three-finger salutes with protest sign reading "It is the music of a people who will not be slaves again"
Meta Llama (3.2 Vision)	A large crowd of people gathered together, with many hands raised in protest.	A large crowd of protesters holding up their hands in a three-finger salute, with one person in the center holding a sign that reads "IT IS THE MUSIC OF A PEOPLE WHO WILL NOT BE SLAVES AGAIN.
Google Gemini (2.0 Flash)	Sorry, I can't help with images of people yet.	Sorry, I can't help with images of people yet.
MiniCPM V	The crowd is raising their hands with signs expressing messages of solidarity and resistance against enslavement or oppression as indicated by one sign that says "IT IS THE MUSIC OF A PEOPLE WHO WILL NOT BE SLAVERY AGAIN.	In this photo from a protest at Democracy Monument in Bangkok, Thailand, people raise their hands to make the three-finger salute "a symbol of defiance against authoritarian rule inspired by The Hunger Games" and hold up signs like "It is the music of a People who will not be slaves again."

Gemini, again, refused to caption the image.

ChatGPT was the only LLM to correctly identify the three finger salute but was unable to link it to Thailand or the Hunger Games. The others just described "raised hands", which isn't particularly useful.

Of the LLM that did provide a description, all managed to figure out it was a protest. Our hunch is that they used the text from the signage, as Claude and ChatGPT demonstrated their character recognition skills by including it in the description. MiniCPM gave it a good go, but incorrectly identified the word "slaves" as "slavery".

When we provided additional context, with the exception of Gemini, all of the LLM changed their response and correctly described the three finger salute and the fact it was Thailand.

Conclusions and thoughts

There is no clear winner

Claude seems to be the most predictable when providing additional context. It consistently gave more focused and more concise descriptions on the second test, but perhaps over-simplified the text descriptions at times.

ChatGPT was a strong performer when provided with just the image. Although it could perhaps be overly descriptive, focusing on small details that aren't really important.

Although there wasn't a clear winner, we have to rank Gemini as the worst performer. Out of four photos and ten tests, it refused to describe half of them. Of the ones it did attempt, it was pretty good. Though, it was the only LLM that fell into the trap of describing the actual Statue of Liberty.

AI struggles without context

When given just the image, all five LLM could definitely describe things in it, but they didn't really know how to focus their attention on the correct details.

Without context, LLM tend to describe surface-level details like colours and generic objects, but they lacked the depth needed to make a text description add to the narrative of the surrounding content.

Context definitely helps

In most cases, providing additional context to an LLM results in much more focused descriptions.

ChatGPT, Claude and MiniCPM used the context well, correctly adjusting their responses to include key details, like the Friends Apartment and the three finger salute.

Both Claude and Gemini had situations where they performed worse when provided with more context, so it doesn't look like a solution is to just provide more and more information.

Llama didn't really do much with the additional context at all. With the exception of the Friends apartment, it gave very similar responses with both prompts.

All the LLM make mistakes

All the LLM performed quite well at a glance. It's clear they can churn out text descriptions at pace, but they all made mistakes.

Some were minor. It probably doesn't matter that ChatGPT described a bird on the statue that wasn't there. But when two of the LLM potentially misgendered a person, that's alarming.

It's clear that LLM can, and will, write text descriptions that are potentially harmful. So for us, this reinforces the need for human review.

Hallucinations are real

LLM are known to hallucinate. They write facts that aren't true and describe things that aren't there. Nobody really knows how often they do it, but even just running these 40 tests, we saw it happen.

The bird on the statue is probably just an honest mistake. The flame on the torch does look a bit like a pigeon. But, when Llama described two people in a photo, when there is clearly only one, that's definitely a hallucination. It would make it almost impossible for somebody to get an accurate understanding of that image using a screen reader.

AI-generated text descriptions are probably better than nothing

Humans have a really strong track record for writing bad text descriptions, or just leaving them out all together. So, the bar isn't particularly high for an LLM.

In our tests, we did observe many of the LLM do a reasonable job of describing each photo and taking the context into account. So, as long as it's accurate, vague AI-generated text descriptions are probably still better than nothing at all.

Unfortunately, no LLM did a good enough job every time, so we wouldn't trust it to provide text descriptions in bulk without checking each one.

AI-generated text descriptions are a good starting point

This looks like the most promising use case for AI-generated text descriptions at the moment - to give you something to work from.

Often, the blank page is the hardest part of writing anything. Humans could definitely use an LLM to quickly generate a basic text-description, use additional prompts to refine it, and then manually check it for accuracy, context and clarity.

Léonie Watson explores this idea in practice in the post "Adventures with BeMyAI".

You can give BeMyAI a photo, then chat back and fourth with it to understand it. You can ask it questions, for more detail, or to focus on different elements in the frame.

Instead of just accepting the first response, this iterative, interactive approach, makes it far more likely you'll actually get something useful that meets your needs.

Final thoughts

It's clear AI can help with writing text descriptions, but it’s not a perfect solution. If you're going to use AI, we think right now, the best approach is human-AI collaboration.

If you can prompt an LLM effectively, and you understand the current limitations of the technology, AI is definitely a useful tool.

So, if you want to use AI to assist you in writing text descriptions, we recommend these five tips:

Be specific: Tell the LLM what the description is for and how it will be used
Keep it concise: Aim for 10–20 words and focus on the most important details
Provide context: Give the LLM supporting information along side your prompt, it will use it
Be sceptical: Manually check responses for mistakes, hallucinations, and missing information
Make it your own: Use AI as a tool, but apply your own judgment and writing style to refine any response

Next steps

Find out more about inclusive XR: accessible 3D experiences or browse our training courses and training programmes.

What we did

The LLM we tested

The prompts we used

The images we chose

Image 1: The "Friends Apartment" Building

Image 2: The Statue of Liberty Replica

Image 3: The Gender Assumption

Image 4: The 3 Finger Salute

Results and observations

Image 1

Image 2

Image 3

Image 4

Conclusions and thoughts

There is no clear winner

AI struggles without context

Context definitely helps

All the LLM make mistakes

Hallucinations are real

AI-generated text descriptions are probably better than nothing

AI-generated text descriptions are a good starting point

Final thoughts

Next steps

We like to listen