At Truss, we want to show our customers rich, up-to-date information about every commercial office space on the market.
In theory, this should be easy. When you search for houses for sale on Zillow, Redfin, or your local real estate agent’s website, each site has almost exactly the same results. That’s because residential real estate sites all use the same source: a central, cooperatively maintained database of all residential properties called the MLS.
In commercial real estate, there’s no direct equivalent. Complete and accurate information for all listings isn’t collected in one place — instead, we have many different sources, all of which range in completeness, accuracy, format, and coverage. When sources do overlap, they often disagree and sometimes even bring in new information.
We set out to make sense of this mess to give a clean and comprehensive view to the operations team, so they could ensure our clients can see the most accurate and robust listings.
Think of an office space, and what a potential tenant might want to know about it:
- Building information gives tenants a general sense of a space’s building. Is it small/large? Traditional or creative space? Does the building have a bike room, conference center, gym? Class A/B/C? Where is it located?
- Space availability for a building changes frequently as new spaces come on and off of the market. It’s important for us to know that a given space is available so that we can build and expose the content around it.
- Space details enrich a space’s listing, and help tenants determine if a given space is right for them. The more information we have about things like layout, pricing, amenities, style and build-out, the better a tenant is able to determine if a space will meet their needs.
It’s worth noting that these categories are hierarchical. If we don’t know a building exists, we won’t know if it has available spaces. If we don’t know a space is available, then we won’t expose details about that space.
Combining and Reconciling Sources
There’s no single source that gives us all of the information that we’d like to present about each of the categories above. Instead, we often need to build out a record with information from several places.
For example, to build accurate building information, we may need to combine the following sources:
- Public business and deed records
- A high-resolution street photos service
- A highly precise geocoding engine to turn addresses into lat/longs
- A data feed provided by the building owner with lobby details and photos
Space availability gets even trickier. Since space details and availability change frequently, our sources may present conflicting information. For example, for the building at 123 Main Street:
- A listing agent gives us an API feed that says suites 400, 500, and 700 are available
- A regional MLS says only spaces 400 and 500 are available
- The building website says that suites 400, 500 and 800 are available
To further complicate matters, most sources don’t give an update when a space comes off the market. Instead, they merely show which spaces are on the market at a given time. Since sources can be out-of-date or stale, it’s not always obvious which source is the truest reflection of the real availability.
In order to assemble the data and choose the most correct source, we devised a four-step pipeline: incorporation, matching, truth discovery, and integration. Once we’ve run a source through the pipeline, we can be confident that the data has been mapped properly, associated with the right entity, scored for “truthiness” and properly loaded into our master data model.
The following diagram describes how our inventory system transforms raw data into final updates:
The incorporation stage accomplishes two things: retrieving data from the source and transforming that data into a common format in our system. The retrieval process is relatively straightforward. Depending on the source, we either connect directly to an API, listen for batch files pushed to us or listen for messages indicating a manual update.
Once the data is ingested from the source, we run it through transformation logic to marshal it into our data format. Depending on the source, this can be anything from an easy mapping to a more complicated normalization. For each building or space in a source, the data goes into a claim set that holds the source, external ID, update time, and entity attributes. Claim sets are the intermediate data object that we’ll later use as we assemble the information about an entity and apply truth discovery.
After incorporation, we’re left with many claim sets from each source. Before we can evaluate these claims against each other and pick the most “truthful,” we have to match them against existing entities on the platform.
The matching phase is responsible for associating each unmatched claim set with an entity in the Truss data model; or, if there isn’t currently an entity that matches, creating a new entity. To do this, we fire each unmatched claim set through a matching algorithm that runs against each entity and provides a probability that the claim set matches that entity. If any of the matching probabilities exceed the matching threshold that we set, we consider the claim set and the entity to be a match and create an association between them.
3. Truth Discovery
Following the matching stage, each claim set is now linked to an entity and we can start analyzing competing claim sets together as a group.
Since these claim sets may have conflicting information, we must now “summarize” these claims with a probability score that we can use to determine which source is authoritative for a given attribute. Truth discovery is a topic of emerging literature (Li, et al 2015) and a full description of our algorithm is a topic for a future post. In general, the following data on the claim sets is some of what is used by our system to determine which source to pick:
- Last update: The more recent a source is, the more likely it is to be correct
- Authority of this source throughout our database: A source that is heavily used and rarely overridden by our ops team is more likely to be correct than a source that rarely shows up
- Manual source weighting: Some sources are “guaranteed” to be correct immediately after an update — for example, a manually entered value following a conversation with the building owner. These are weighted highly at the time of update and then decayed as time passes
- Agreement of this source with other sources: when we have multiple sources for the same field value, an outlier value is less likely (but not impossible) to be more correct than the consensus
The output of the truth discovery model produces a tuple of(value, probability) for each combination of field and source for an entity. The enumeration of (value, probability) for each field and source is called the claims summarization. This object is then passed to the integration stage for final business logic and the entity update.
In the integration stage, we update the data from the claims summarization into the entity. In principle, this stage can be trivial — we could simply select the highest probability claim from the claims summarization for a given entity and update that entity.
In practice, however, we’ve allowed for some business logic to ensure that we can manage the system effectively, maintain control over edge cases, and analyze source performance. Therefore the integration layer handles a few other nice-to-haves that allow our operations team to gain transparency into the system and manage special circumstances.
In addition to selecting the highest probability value, the integration stage also handles the following:
- Provides a master toggle at varying levels of granularity: source-by-source, market-by-market, space-by-space. This greatly simplifies the rollout process for new sources
- Publishes the full claim summarization to our analytics tools so that our operations team can easily audit which value the model picked along with which values were unused
As we evolve the system, we may use the integration stage as a testing ground for new manual logic before it’s incorporated into the model; however, the end goal is to keep the integration stage as lightweight as possible.
Keeping up with a variety of sources that each have different strengths and weaknesses is complicated, but one that is essential to creating a scalable, rich and robust set of inventory data. By designing a system from scratch to accommodate the steps of incorporation, matching, claim summarization, and integration, we’ve greatly simplified the process and ensured that we’re ready to add new sources and better methodologies well into the future.