BLOG
Cover image

How we use Gemini to map the chaos of Real Estate listings in Costa Rica

Scraping real estate data is easy; figuring out where the houses actually are is the hard part. We tested ChatGPT, DeepSeek, and Gemini against each other to solve our location mapping problem. Here is why we went all-in on Gemini 2.0 and how we improved our automated accuracy from 64% to 98%.

If you’ve ever tried to build a centralized platform by scraping data from more than 50 different sources, you know the pain. At propiedades.cr, we scrape listings from major players like Century21, Neocasa, LX Costa Rica and Inhaus. The scraping part is easy. The hard part? Making sense of where these houses actually are.

See, real estate agents are great at selling houses, but they aren’t always great at geography. In Costa Rica, we have a strict hierarchy: Province > Canton > District > Condo > Neighborhood. But an agent listing a house might just say, "Located in Lindora."

Here’s the problem: "Lindora" isn't a district. It’s a zone inside the district of Pozos. If our search engine expects a district, and the listing only gives a vague region, that property falls into a black hole.

That’s where Gemini comes in.

The LLM Wars: Why We Chose Gemini

Back in October 2025, we ran tests to solve this. We benchmarked ChatGPT, DeepSeek, and Gemini to see who could best understand Costa Rican geography.

The results were surprising. ChatGPT, despite being the most expensive, was actually the least successful at determining specific Tico locations. It just didn't get the nuances of our addresses. It came down to a tight race between DeepSeek and Gemini. At the time, they were neck-and-neck on metrics, with DeepSeek being slightly cheaper.

But fast forward to December, and Gemini started pulling ahead. The latency dropped, and the accuracy improved. We made the switch and haven't looked back. Right now, we are running everything on Gemini 2.0 Flash-Lite, and it is a beast.

The process

Here is how our pipeline works today.

  • The Fetch: Our Python scraper grabs a property.
  • The Analysis: We feed the unstructured description into Gemini.
  • The Verdict: The model returns a structured JSON object with a "Confidence Rating."

The whole process—from the moment the scraper touches the listing to the moment it’s fully analyzed and mapped—takes about 12 seconds.

If Gemini returns a High confidence rating, we automatically publish the location. If it’s Low, we flag it for manual review.

Take a look at this actual data from our logs. We fed it a listing description that was a bit all over the place. Gemini analyzed it and spit this out:

json
{
  "Condo": {
    "Name": "Edificio Q-BO Skyhomes",
    "Confidence": "High",
    "Source Field": "description",
    "Reasoning": "The description explicitly mentions 'QBO Skyhomes', which is present in the condo list."
  },
  "Neighborhood": {
    "Name": "Rohrmoser",
    "Confidence": "Medium",
    "Source Field": "description",
    "Reasoning": "The description explicitly mentions Rohrmoser as the location."
  },
  "Location": {
    "Province": "San José",
    "Canton": "San José",
    "District": "Mata Redonda",
    "Confidence": "Low"
  }
}

Even though the confidence on the specific district was "Low" (likely because the agent didn't write "Mata Redonda"), Gemini successfully identified the building ("Q-BO Skyhomes") and the neighborhood ("Rohrmoser"). Because our internal database knows that Q-BO is in Mata Redonda, we can still map it perfectly.

From 64% to 98%

The improvement has been insane. When we first started this in October, before we fine-tuned our prompts, we had a success rate of about 64%. That meant we were still manually fixing nearly 4 out of every 10 listings. It was better than nothing, but still a headache.

Today, running on Gemini 2.0 with our optimized prompts, our "High Confidence" hit rate is 98%.

Just this morning, I was looking at the logs for a property in Condominio Bali. The description was messy, mixing up amenities with location data. Gemini parsed it, matched "Condominio Bali" to our database ID (bS9RPYgb8MlySQxt01Ii), identified the neighborhood as "Guachipelín de Escazú," and updated the record automatically.

It’s satisfying to watch the logs roll by and see the system cleaning up the mess, 24/7, checking for updates every 48 hours. It feels like we finally taught the machine how to read a Tico address.

Want to see it in action: Head to www.propiedades.cr and search for real estate property using the search bar or our AI assistant.