How SpotScore™ Is Calculated: A Methodological Walkthrough of Block-Level Safety Ratings

A safety score is a compressed claim about a place. Compress badly and the number is worse than no number at all — it produces confident decisions on a thin signal. This is a walkthrough of what goes into a SpotScore™, what does not, and the methodological choices that determine whether a single digit between 1 and 10 carries any information at all.

Why a single number, and why be skeptical of it

Single-number safety scores exist because the consumer of a real estate page, a family-safety app, or a duty-of-care dashboard does not want to read a 90-day incident log. They want a comparison: is this block safer than the one three streets over. A score serves that comparison only if the underlying methodology is defensible at the block level and stable across geographies.

The 2024 national numbers set the context. According to USAFacts, the US violent crime rate was 359 per 100,000 residents and the property crime rate was 1,760 per 100,000 — year-over-year declines of 5.4% and 9% respectively (usafacts.org). The Real-Time Crime Index has tracked further declines into early 2026. National averages, however, obscure the block-level variance that determines whether a score is useful: Alaska’s violent crime rate (724 per 100,000) is more than seven times Maine’s (100 per 100,000), and the within-city variance is larger still. A score that does not respect that variance is a score that flattens information.

The input layer: what an incident actually is

A SpotScore starts with verified incident records sourced directly from the originating police department, sheriff’s office, or state agency — not scraped from aggregator feeds. Each record carries, at minimum, an incident type, an event date, and a geocoded location. Most jurisdictions also expose a report date, a status field (reported, founded, cleared), and a textual narrative or supplemental classifications.

The first methodological choice is which fields to trust. The event date is the date the incident is alleged to have occurred; the report date is when the agency entered it. The gap between the two can range from minutes to months, and the distribution of that gap differs by crime type — property crime is reported later than assault, and sexual offenses later still. SpotScore uses event date wherever available, falling back to report date with a flag, because temporal decay (discussed below) is the dominant time-axis effect and applying it to a stale report date would systematically bias the score upward in places with administrative lag.

The taxonomy problem

US jurisdictions do not classify crime the same way. Some still report under SRS (Summary Reporting System). Most have transitioned to NIBRS (National Incident-Based Reporting System), which captures more detail per incident but reorganizes Part I categories. City-level offense codes vary further — the LAPD’s offense code list is not the Chicago Police Department’s, which is not the Houston Police Department’s.

Before any score can be calculated, every incident has to be mapped to a normalized internal taxonomy. SpotCrime maintains a mapping layer that resolves agency-specific codes to a stable set of categories — assault, robbery, burglary, theft, motor vehicle theft, weapon offense, vandalism, drug offense, public order, and others. The mapping is reviewed when an agency changes its code list, which happens more often than the public-facing documentation acknowledges.

The taxonomy choice matters because severity weighting (next) operates on the normalized category, not on the raw agency code. If a jurisdiction classifies “aggravated assault with a weapon” under a code another jurisdiction uses for “simple assault,” an un-normalized score will rank them differently for the same underlying event. The mapping layer is the unglamorous infrastructure that determines whether two scores in two cities mean the same thing.

Severity weighting

Not every incident contributes equally to a safety score, and the question of how much weight to assign each category is the methodological decision most exposed to legitimate criticism. There are three common approaches.

Equal weighting treats one car break-in and one aggravated assault as equivalent. This is indefensible at the conceptual level but appears in some commercial scores because it is computationally cheap. It produces a score dominated by property crime in volume-dense neighborhoods, even when those neighborhoods have low violent-crime rates.

Sentencing-derived weightinguses the median or maximum sentence for each offense category as a proxy for severity. This is principled but inherits the distortions of sentencing policy — a category with historically heavy sentencing (drug offenses in some eras) ends up overweighted relative to its actual impact on neighborhood safety.

Public-risk weighting assigns weights based on the risk an incident type poses to a member of the public who lives near the event. Violent offenses against strangers receive the highest weights. Violent offenses where the victim and offender were known to each other receive intermediate weights, because the risk transfer to a neighbor is meaningfully lower (acknowledged in criminology literature; not absent, but different). Property offenses are weighted by the likelihood of confrontation: an unoccupied burglary at 2 p.m. carries less neighborhood-safety weight than a residential burglary at 2 a.m. with occupants home, even though both are classified as burglary.

SpotScore uses a public-risk-weighted scheme. The exact coefficients are tuned against rank-order stability checks: a score that ranks Block A above Block B today should not flip the ranking next week purely because of a coefficient choice. The tuning is conservative — categories with the most ambiguous public-risk interpretation (vandalism, drug offenses without a violent component) carry low weights so that they cannot single-handedly move a score.

Temporal decay

A burglary that happened yesterday is a more relevant signal than a burglary that happened 24 months ago. Both belong in the historical record. The question is how to combine them.

SpotScore applies an exponential decay function with a half-life chosen per category. Violent offenses decay slowest (the half-life is approximately 18 months), reflecting both the longer behavioral signal and the lower volume per block. Property offenses decay faster (approximately 9 months). The decay is per-incident, not per-block, so a block with one recent serious incident and one old one is not treated as identical to a block with two old incidents.

The 36-month window is a hard cap. Incidents older than 36 months are excluded entirely. The cap is not a statement about whether older incidents are relevant — in some block-level contexts, they are. It is a statement about confidence: agency reporting standards, geocoding accuracy, and taxonomy mapping drift over time, and the further back you go, the more the data is comparing today’s definitions to yesterday’s recordkeeping.

Geographic aggregation: block vs. ZIP vs. census tract

This is where most safety scores fail silently. A ZIP code is, on average, a few square miles. A census tract is smaller but designed for residential population stability, not for crime measurement. A block group is smaller still. A street-level address is smaller again. The choice of unit determines whether two adjacent properties get the same score or different ones, and the answer affects every downstream consumer.

SpotScore is computed at the block level (typically a face block or its equivalent in dense or rural geographies), with smoothing toward an immediate-neighborhood aggregate to handle sparsity. A block with two incidents in the past year is sparse data; treating its rate naively produces wild noise. The smoothing borrows strength from the surrounding eight to twenty-four blocks (depending on density), weighted by distance and population.

The block-vs-ZIP distinction matters in dense urban geographies in particular. Within a single ZIP in any large US city, block-level scores can span the full range. Insurance underwriting that uses a ZIP-level rate is averaging across that range. So is most residential real estate search. The information loss is not theoretical — it is the difference between a buyer who sees an honest block-level score and a buyer who sees a ZIP-averaged number that does not match what they will experience.

Normalization: per capita, per address, per business

Raw incident counts favor low-population blocks. A block with five residents and one annual incident has a higher per-capita rate than a block with five hundred residents and ten annual incidents, even though the second block is materially less safe per night spent there. The choice of denominator is consequential.

For residential-context scores, the denominator is the residential population (from census ACS estimates, the most recent vintage). For mixed-use or commercial contexts, the denominator is residents plus a fractional adjustment for daytime population — commuters, customers, students. The fractional adjustment is imperfect (LEHD data is the standard source, and it has known limitations), but a pure-residential denominator in a downtown commercial district produces obviously wrong scores. Commercial districts are not unsafe; they are populous in ways the residential census does not capture.

The calibration problem

Any score that compresses into a 1–10 range has to confront the question of what the digits mean. A 5 should mean “average for this metro,” or “average for the national distribution,” or “the underlying score falls in this percentile band” — and these are not the same thing.

Calibration here means the same problem identified by Gio Circo in the AI-classifier context: a model can be accurate on average and still be wrong in its confidence on individual cases (gmcirco.github.io). For a safety score, the analogous failure mode is that scores cluster too tightly around the metro mean, so the difference between a 6 and a 7 carries no real information, or scores are too spread, so the difference between a 6 and a 7 implies a larger risk gap than the underlying data supports.

SpotScore is calibrated against a metro-relative distribution and then anchored to a national reference scale, so that a 5 in Cleveland and a 5 in Tucson reflect comparable percentile positions within each metro, while a 1 anywhere is reliably worse than a 9 anywhere else. The calibration is checked quarterly against fresh data, and the recalibration log is available on request for enterprise consumers. The score is not stable to the third decimal — it should not be. It is stable to the digit, which is the resolution at which it is consumed.

What developers should display, what they should not

A score is not a forecast. It summarizes the recent past, weighted and decayed and normalized, and presents a comparative position. Developers consuming the API should display it with that framing — not as a prediction of what will happen, but as a summary of what has been reported.

Practical display guidance:

Show the score next to the time window it summarizes (the API returns this explicitly). A 7 over 36 months and a 7 over 90 days are different facts.
Show the underlying incident count or rate when space allows. A 5 backed by 200 incidents per 1,000 residents over 36 months is a stronger claim than a 5 backed by 6 incidents on a sparse block.
Do not synthesize scores across blocks by averaging the digits. Re-query the API with the appropriate radius. Averaging discrete scores destroys the calibration.
Do not display the score with false precision (a 7.34 is a false claim about the data’s resolution; a 7 is honest).
Surface the “data freshness” field. A jurisdiction with a 48-hour reporting lag is producing a different underlying signal than one with a real-time feed, and a transparent consumer surface should reflect that.

What the score cannot tell you

A SpotScore is a summary of recorded incidents. Recorded incidents are not all incidents. They reflect the events that someone reported, that an agency wrote down, that a records system captured, and that the agency released. Each step is a filter. Domestic violence is systematically underreported. Bicycle theft is often not reported at all in dense urban areas. White-collar offenses against neighborhood businesses are reported inconsistently. The score is therefore a measure of reported crime, not a measure of total crime, and the gap between the two varies by jurisdiction.

The score also cannot tell you about the future on its own. Crime rates are mean-reverting in some contexts and persistent in others; the post-2020 national decline was not predicted from 2019 data, and the 2024 year-over-year drop was not visible until the FBI release. A score is, at most, the most recent data point you have. Treating it as a forecast is a methodological error the API does not make and the integrating product should not introduce.

Finally, a score is not an excuse to skip the underlying incident list. The list is what gives the score interpretive content. A block scored 4 because of a series of vandalism complaints means something different from a block scored 4 because of one serious incident, even though the digit is the same. Surface the incidents. The score is a navigation aid, not a substitute.

Where the methodology is still imperfect

Three open issues are worth naming, because no published safety score solves them and pretending otherwise would be a fair criticism of any methodology document.

Jurisdictional reporting gaps.The 2021–2024 NIBRS transition knocked out federal coverage for a significant share of US agencies, and city-level publication has tightened in some metros (the LAPD being the most-cited recent example). SpotCrime’s direct-from-agency pipeline mitigates this but does not eliminate it — if an agency does not publish, no aggregator can reconstruct what they are not releasing.

Geocoding precision. Some agencies geocode to the address; some to the block face; some to the hundred-block; some to the intersection. The score handles this through smoothing, but at the limit of a sparse block, a single mis-geocoded incident can move a digit. The API exposes a geocoding-confidence flag; developers building high-stakes displays should respect it.

Definitional drift. Agencies change their offense codes and reporting standards. The mapping layer adjusts, but adjustments are not instantaneous, and the historical record cannot be retroactively reclassified without a level of confidence that does not exist for most agency archives. The 36-month window is partly a hedge against this.

The honest version of a one-number score

A safety score is the smallest possible summary of a large, messy, agency-mediated dataset. Done well, it replaces a guess with a number that holds up under inspection. Done badly, it replaces a guess with a different guess that looks authoritative because it has a decimal point. The difference between the two is entirely in the methodology — in which fields are trusted, which categories are mapped, which weights are assigned, which decay is applied, which geography is the unit of analysis, and which calibration anchors the digits.

None of those decisions is a closed question. The methodology document for any honest score is a living artifact, revised when the data changes and when better techniques become available. SpotScore is no exception. The aim is not to publish the right number once. It is to publish a number that is right enough, defensible in its construction, and surfaced with enough context that the consumer can decide how much weight to put on it. That is the most an API of this kind can offer. It should also be the floor.