Building Smarter 311 Systems with Vision Models

Government
AI
LMM
Author

Hersh Gupta

Published

December 14, 2024

Every day, residents of any given city might submit hundreds of 311 requests for everything from broken sidewalks to illegal parking. These systems are important in responding to a city’s service requests.

The process seems simple: a resident reports an issue, and the appropriate department responds. But behind the scenes, 311 employees have to constantly recategorize misclassified requests and merge duplicate reports about the same problem.

A pothole might be reported as “street damage” by one resident and “road hazard” by another, requiring manual intervention to connect these related cases.

Artificial intelligence—specifically large multimodal models (LMMs)—could transform 311 operations.

Why use AI?

It’s not just about technological efficiency—it’s about creating tangible improvements in city services. When a broken traffic signal is reported, accurate categorization means faster response times. When multiple residents report the same fallen tree, automatic merging ensures resources aren’t split across duplicate tickets.

Implementing LMMs in 311 systems would be like a phone’s autocorrect, but for service requests. As users are submitting requests, it would:

Ideally, users would be able to verify the suggestions before submitting their requests, so there’s less danger of AI slop ending up in 311 systems.

How does this work?

A big problem with integrating AI systems is getting consistent outputs. That’s why the folks at StanfordNLP have built DSPy – a framework for writing programs for language model workflows.

Using DSPy, we can create a reliable pipeline for processing 311 requests that takes the request description text and images, and generated an updated request:

original_request: list -> updated_request: list

I’ve created an example implementation that demonstrates these concepts, available in the Smart 311 System repository. The repository includes both a simple standalone example and a more complete implementation showing how to structure a production-ready system.

Here’s how it works in practice. First, we configure the LMM – here, I’m using a local instance of LLaVA, an open-source model that can understand both text and images:

import dspy
from typing import Literal

# Configure our local LLaVA model
lm = dspy.LM("ollama/llava:7b", api_base="http://127.0.0.1:11434/", api_key="")
dspy.configure(lm=lm)

Next, we define all possible service categories. This ensures our system can only assign requests to real departments:

ServiceNameLiteral = Literal[
    "Outdoor Dining", "Streetlights", "Litter",
    "Broken Park Equipment", "Damaged Sign",
    # ... and so on for all services
]

The heart of the system is its understanding of how to process requests. We provide clear instructions through a system prompt:

SYSTEM_PROMPT = """You are a municipal services assistant. Analyze the request
description and image to create a detailed description for city workers.
GUIDELINES:
- Use professional, factual language
- Include details from both text and images
- Remove subjective or emotional language
- Focus on observable facts and specific details
- Include measurements or quantities when visible
- Note any accessibility or safety impacts
"""

Now for the clever part. We create a Signature that tells DSPy exactly what we expect as input and output:

class EnhancerSignature(dspy.Signature):
    """Improve the description based on the original description and image."""
    image = dspy.InputField(desc="The image showing the issue")
    original_description = dspy.InputField(desc="The original description text")
    enhanced_description = dspy.OutputField(desc="Enhanced actionable description")
    recommended_service_category = dspy.OutputField(desc="Recommended service category")

Finally, we can process actual 311 requests. Here’s how it works in practice:

# Initialize our predictor
predictor = dspy.Predict(EnhancerSignature)
# Process a new request
image = dspy.Image.from_url("path_to_image.jpg")
description = "Garbage on sidewalk"
result = predictor(
    image=image,
    original_description=description
)

When a resident submits a request about garbage on the sidewalk and includes a photo, our system:

  1. Analyzes both the text description and the image
  2. Creates a detailed, professional description for city workers
  3. Automatically selects the correct service category
  4. Returns all this information in a structured format

The full implementation, along with example code and detailed documentation, is available in the Smart 311 System repository. The repository includes both a simple example for understanding the core concepts and a more complete implementation showing how to structure a production-ready system.

What’s great about DSPy is its flexibility and consistency. It’s able to handle complex prompt engineering and chain-of-thought reasoning behind the scenes, letting us build a reliable system for 311 users.

A brief example

To demonstrate this concept, we can use a real-world example. Here’s a request from my city’s 311 https://311.boston.gov/tickets/101005820254:

This is what the unformatted request looks like (i.e. what is received before it turns into a nice webpage):

[
  {
    "service_request_id": "101005820254",
    "service_name": "Requests for Street Cleaning",
    "service_code": "Public Works Department:Street Cleaning:Requests for Street Cleaning",
    "description": "Garbage on sidewalk",
    "address": "19 Worcester St, Roxbury, Ma, 02118",
    "media_url": "https://spot-boston-res.cloudinary.com/image/upload/v1734138056/boston/production/x1vhcizd7nvzb7xdzkfe.jpg#spot=e8762677-4564-41c3-b58b-3ddd206801cb"
  }
]

The model workflow takes that request, considers the image and text, and returns the following:

{
  "original_service": "Requests for Street Cleaning",
  "recommended_service": "Requests for Street Cleaning",
  "updated_description": "The area is littered with trash and debris. A blue box is present on the sidewalk by 19 Worcester St, Roxbury, Ma, 02118, which needs to be removed",
  "emergency": false,
  "image_verified": true,
  "confidence": 0.85,
  "rationale": "The image shows a sidewalk with litter scattered around, which is consistent with the current service classification of \"Requests for Street Cleaning\". The presence of trash on the sidewalk indicates that it requires attention from the city's street cleaning services to maintain cleanliness and hygiene in public spaces.",
  "service_request_id": "101005820254",
  "address": "19 Worcester St, Roxbury, Ma, 02118",
}

It verified the correct classification, updated the description, and correctly identified that it was not an emergent issue. It also provided a more detailed rationale and rated how confident it was with the updated classification and description.

Here’s what to consider

With any AI-powered technology, cities need to approach them carefully. Public trust is essential and any system that uses AI needs to be accurate and transparent. Here, cities would audit it for bias and performance reviews should be part of any implementation plan.

Also, while compute costs continue to drop, cities should weigh the benefits of implementing something like this against waiting for vendors to integrate these features into standard 311 offerings.

The goal, of course, isn’t to replace human judgment but to augment it, allowing 311 staff to focus on what matters most, responding the requests from residents.

By weighing the tradeoffs of AI thoughtfully, cities can build systems that are more efficient and responsive to community needs.