AI in Law Enforcement: What Crime Data APIs Can Actually Support

“AI in law enforcement” is doing a lot of work as a phrase. It covers facial recognition, automated license plate readers, gunshot detection, predictive deployment, report drafting, and chatbots that answer records requests — technologies with wildly different evidence bases and risk profiles. This post narrows the question to one thing developers can reason about precisely: what an AI system built on top of a descriptive crime data API can and cannot support. The answer is more modest than the marketing, and the boundary is worth drawing carefully.

Start with what the data actually is

A descriptive crime data API returns a record of reported incidents: an offense category, a timestamp, a geocoded location (usually block-level rather than exact address), and sometimes a disposition. It is a log of what was reported, normalized across jurisdictions. It is not a measure of all crime that occurred, not a prediction, and not a risk score until someone deliberately computes one. That distinction — descriptive versus predictive — is the single most important thing to hold onto when evaluating any AI layer placed on top, and we have written about it at length elsewhere.

The reported-versus-occurred gap is not a footnote. Reporting rates vary by offense type, by neighborhood, by trust in police, and over time. The FBI's own transition to the National Incident-Based Reporting System (NIBRS) left coverage gaps that, at the 2021 changeover, affected a large share of the US population as agencies dropped in and out of reporting. Any model trained on reported-incident data inherits every one of these biases. An AI system does not see crime; it sees the paperwork that crime generated, filtered through who chose to call and which agency chose to publish.

The national baseline, for grounding

Before discussing what AI changes, it helps to fix the numbers AI would be operating on. According to USAFacts' compilation of FBI Uniform Crime Reporting data, the US violent crime rate in 2024 was approximately 359 per 100,000 residents and the property crime rate roughly 1,760 per 100,000 — year-over-year declines of about 5.4% and 9% respectively. Measured against 2001, overall crime is down on the order of 49%. State-level dispersion is wide: Alaska posted the highest violent crime rate (around 724 per 100,000) while Maine posted the lowest (around 100 per 100,000).

Why the baseline matters for AI

When base rates are low and falling, the prior probability that any given alert reflects a real, actionable event is low — and a classifier that is even slightly miscalibrated will produce a high ratio of false positives to true positives. This is base-rate arithmetic, not pessimism. An AI alert layer built on 2024-era data is, by construction, hunting for rare events.

Where AI plus crime data genuinely helps

None of the above means AI is useless here. There are tasks where a language model or a conventional statistical model, fed clean descriptive data, does real work. The common thread is that these tasks summarize or organize the past rather than assert the future.

Retrospective analysis and reporting.Producing a weekly CompStat-style summary — counts by category, by beat, by time-of-day, with this-week-versus-last-year comparisons — is mechanical, repetitive, and exactly the kind of thing that should be automated. The crime analyst Andy Wheeler's walkthrough at crimede-coder.com shows how far a few hundred lines of pandas, matplotlib, and SQL go toward replacing manual report assembly. An AI layer on top of an API can draft the narrative around those tables. The numbers come from deterministic queries; the model writes the prose. That division of labor is the safe one.

Geographic and temporal pattern surfacing. Identifying that residential burglaries cluster in a particular quarter-mile corridor on weekday afternoons is descriptive statistics, and a model can surface it faster than a human paging through spreadsheets. The honest framing is “here is where reported incidents concentrated last quarter,” not “here is where crime will happen next week.” The same hotspot math can be done well or misleadingly, and the choice of kernel and bin size matters more than the model sitting on top.

Records-request triage and search.Routing, de-duplicating, and summarizing public records requests, or letting a resident ask “what was reported on my block last month” in natural language, is a retrieval problem with a ground-truth dataset behind it. The risk is bounded because the model is fetching records that exist rather than generating claims about events that might.

Where it gets dangerous: prediction and alerting

The trouble starts the moment an AI system is asked to move from “what was reported” to “what is happening now” or “what will happen next.” Two failure modes deserve specific attention.

Real-time alerting optimizes for the wrong thing. On April 13, 2026, the AI-driven alert product CrimeRadar pushed an active-shooter notification to parents in Mount Vernon, Missouri, after misinterpreting a routine police radio transmission. No shooting occurred; a school went into a false lockdown. As the write-up at oldmantrench.com put it: “Systems like this are built to be fast. Safety requires being right. Those two things are not the same.” We unpacked the asymmetric cost of false positives in a dedicated post. The short version: for a small set of incident types, the cost of a false alarm is not symmetric with the cost of a miss, and a system tuned for latency will generate the expensive kind of error.

LLM confidence scores are not the probabilities they look like. A subtle but consequential problem is that when a language model attaches a confidence to a classification — “87% likely a robbery” — that number is usually not calibrated. In a careful empirical study using NEISS injury narratives, Gio Circo found that LLM-derived confidence scores and raw token probabilities are systematically overconfident: a model claiming 90% certainty is correct considerably less than 90% of the time. For a retrospective summary this is a nuisance. For a deployment or alerting decision — where a confidence threshold gates an action with real-world consequences — it is a structural hazard. You cannot threshold safely on a number that does not mean what it appears to mean.

The calibration test every deployment should pass: take the events your model labeled “80% confident” over a historical window and check whether roughly 80% of them were actually true. If they were not — and they usually are not, out of the box — every downstream threshold built on that score is mis-set. Calibration is measurable. Measure it before you ship, and re-measure it on a schedule, because it drifts.

Resource deployment: a harder case than it looks

“Predictive deployment” — using historical incident data to suggest where patrol resources should go — sits in an uncomfortable middle. The math is legitimate: places with more reported incidents last month tend to have more next month, because both reflect stable underlying conditions. But the feedback loop is the problem. Sending more officers to an area generates more stops, more reports, and more recorded incidents, which the model then reads as confirmation, which justifies more deployment. The data stops being a measurement and becomes a record of where police were sent. A descriptive API cannot, by itself, tell you whether an uptick reflects more crime or more enforcement of the same crime, and any AI layer that ignores that distinction will launder a policing decision into the appearance of an objective forecast.

This is not an argument that deployment analytics should never exist. It is an argument that the honest version requires holding the reported-versus-occurred gap explicitly in the model, treating the output as a hypothesis to be checked against independent indicators, and resisting the temptation to describe a deployment recommendation as a crime prediction. The two are not the same claim.

A practical division of labor for developers

If you are building an AI feature on top of a crime data API, whether for an agency, a corporate security team, or a consumer product, the durable rule is to let deterministic code own the numbers and let the model own the language. Concretely:

Compute counts, rates, comparisons, and geographic aggregations with explicit queries against the API, not with the model. A SQL GROUP BY does not hallucinate; a model summarizing a table sometimes does.
Use the model for narrative, retrieval, summarization, and natural-language interfaces — tasks where it is recombining text it was given rather than asserting a quantity.
Treat any model-emitted probability as uncalibrated until you have checked it against held-out historical outcomes, and re-check on a cadence.
For anything that triggers a real-world action — an alert, a dispatch, a lockdown — require a human decision-maker between the model and the action, and design for the false-positive cost, not just the average case.
Always carry the metadata: what jurisdiction, what reporting period, what known coverage gaps. The Real-Time Crime Index and the FBI UCR measure different things on different clocks; a model that blends them without tracking provenance produces confident nonsense.

The boundary, stated plainly

AI built on descriptive crime data is good at compressing and communicating what was reported. It is bad — and dangerous in proportion to the stakes — at asserting what is happening or what will happen, because the underlying data does not contain that information and the models tend to overstate their own certainty about it. The most useful systems we have seen draw that line explicitly: deterministic analytics for the facts, language models for the framing, a human for the decision, and an honest accounting of what the data can and cannot say. That is a less thrilling product than the one in the pitch deck. It is the one that does not put a school into a false lockdown.

The quality of any of this is capped by the quality of the underlying feed. A model reasoning over inconsistent categories, stale timestamps, or ZIP-centroid geocoding will produce polished, confident, wrong output. Clean, normalized, address-level data with documented provenance is not a luxury for AI applications — it is the precondition for them being safe to deploy at all.