Enigma Blog

Can Business Data Predict March Madness?

Enigma — Thu, 09 Apr 2026 00:00:00 GMT

Enigma has a database of 31 million U.S. business locations. The database includes card transaction revenue, industry classifications, business registrations, and a lot of other things that have absolutely nothing to do with basketball. So naturally we tested all of it against the results of NCAA basketball tournaments, past and present, to see if any variety of business data somehow correlates with winners (and then worry about causation later).

So, we ran 590 business metrics against every March Madness game over the last ten tournaments: pizza shop revenue, dispensaries per church, the ratio of Pilates studios to BBQ joints, the number of people who list themselves as CEO on their business registration. For each game, we checked which team’s home territory (city and state) had more of a given metric, and whether that team won.

Some of the results were genuinely strange.

The ratio of divorce lawyers to wedding venues, for instance, happened to pick the winning team in 62% of games this year. It also picked winners at a 61% rate across the previous five tournaments. Of all the strange metrics we tested, this was the one that kept showing up, year after year, for reasons we cannot begin to explain. Delis per capita went 66%. Average pizza shop revenue, measured from actual card transactions, went 64%. We have no idea why.

Test 590 things and some will look good by chance. Different metrics spike and collapse from year to year, and nothing here constitutes a prediction system. But as a way to fill out a bracket when you don't know anything about basketball, or to file a side bracket next to your real one just for the absurdity of it, we think it's pretty hard to beat. So we also made a website for fans and anti-fans alike to explore spurious connections between business data and the world of sports:https://bracketology.enigmacorp.dev/

Business Data in College Towns

The best part of this project wasn’t the bracket-building itself, but searching through the textured landscape of data about each place where a March Madness university resides. Four cities made the Final Four, and each one has a very specific commercial character.

Ann Arbor, Michigan (Champion)

Coffee shops earning $401,000 a year in card revenue, more than double the rate in Storrs.
96 dispensaries within 20 miles.
3.8 sushi restaurants per BBQ joint, 95th percentile nationally.
Only one Chick-fil-A (located in the Michigan Union, site of John F. Kennedy’s “ask not what your country can do for you” speech).

Storrs, Connecticut (Runner-up)

One Starbucks for every 17 Dunkin' locations.
Pizza shops averaging $424,000 a year in card revenue.
The highest LLC formation rate of any tournament area at 64.5%.

Tucson, Arizona (Final Four)

283 tattoo parlors within 20 miles, more than the other three Final Four cities combined.
69 businesses with “Wildcat” in the name, suggesting real hometown spirit.
21 florists per funeral home.

Champaign, Illinois (Final Four)

One Chick-fil-A within 20 miles, bottom 2% nationally.
Four laundromats per dry cleaner.
The highest casino-to-financial-advisor ratio of any Final Four city.
Tattoo parlors charging an average of $340 per transaction, three times the rate in Tucson.

What Did Winning Cities Have in Common?

Compared to the teams they eliminated in the Elite Eight, this year's Final Four cities had:

93% fewer people calling themselves CEO on their business registrations. Champaign had 70. Durham, home of the Duke team they beat, had 11,624. Ann Arbor had 172. West Lafayette, Purdue's home, had 657.

The Final Four cities also had far more dispensaries per religious organization, and far more Thai restaurants per Mexican restaurant. Again: we are not claiming that humility, cannabis access, or Thai food wins basketball games. But we did run the numbers.

About the Championship Game

Going into the final, nine of the top ten metrics sided with UConn. Connecticut had more delis, higher pizza shop revenue, more divorce lawyers per wedding venue, and fewer haunted houses (4 versus Michigan's 30).

Michigan won anyway.

The only metric that sided with Michigan: average business revenue growth. Michigan's businesses were growing faster than Connecticut's. As for the rest of this analysis, make of it what you will.

One More Thing

91 of our 590 metrics happened to pick the exact Final Four. Among them: psychics per capita. We'll leave it there, but you can learn much more in last year’s newsletter spotlight on the mystical-industrial complex.

And as for retrospective bracketology, you can explore the full dataset at bracketology.enigmacorp.dev. Next March, when someone invites you to a bracket pool and you've never watched a game in your life, you'll know where to go.

‍

Enigma Government Archive

Hicham Oudghiri — Thu, 26 Mar 2026 00:00:00 GMT

Government agencies produce the most detailed records about how businesses operate in America. Registrations, inspections, compliance filings, licenses, liens, permits. Thousands of agencies at every level of government, and no two of them organize it the same way.

None of it is connected. A restaurant's health inspection in New York looks nothing like its food safety record in Chicago, which looks nothing like its liquor license in Maryland. All three describe the same business.

We've spent over a decade building our business identity graph: entity resolution, verified identities, card revenue signals. But thousands of government databases hold information that structured integration can't reach. There are too many formats to map one by one. We needed a different approach.

That's why we built Gov Archive.

Every Gov Archive search is anchored in verified business identity: the same graph that powers our KYB and compliance products. The system knows which business it's looking for: name variations, legal entities, DBAs, all of it. A text search returns mentions. Gov Archive returns matched records.

Inside we have billions of records across thousands of datasets, and the collection keeps growing. OSHA citations, EPA compliance, cannabis licensing, UCC filings, building permits, government contracts, and thousands more. Each source refreshes on its own cadence, and the system tracks how current each one is.

See it for yourself: at enigma.com/govarchive you can profile datasets from FDIC bank failures to EPA toxic releases to state cannabis licenses. Browse by agency, state, or risk domain.

The system resolves a business to a verified entity, expands to every known name variation, then searches across billions of government records. The results return as a single profile, deduplicated across jurisdictions. Every data point traces to the specific agency, dataset, and filing it came from.

A single query can return health inspections across multiple cities, EPA enforcement records, OSHA citations, cannabis licenses linking LLCs to trade names, building permits, UCC filings, government contract spending.

Research that once meant navigating dozens of agency portals, covering only the sources you knew to check, now returns a single deduplicated profile in seconds, with every data point linked to the original filing.

An EIN discovered in a state incentive filing or an address confirmed through a permit record flows back into the identity graph as verified data.

The system cuts deep into the bedrock datasets that matter for enhanced due diligence: OSHA violations, EPA compliance, active litigation, licensing status across jurisdictions. These are signals standard commercial sources don't carry, returned fully cited.

For compliance teams: when automated KYB checks run out of data to act on, Gov Archive surfaces the evidence that closes cases faster. One example: a major payroll platform needed to verify cannabis dispensaries across state-by-state licensing regimes. The records didn't exist in any commercial database, so we sourced licenses from every state and made them queryable in weeks.

Gov Archive is also available as an MCP server. That means AI agents can query it directly, resolving the entity, searching the archive, and returning a sourced profile as a native step in any MCP-compatible compliance or due diligence workflow.

The collection keeps growing as our system gets better at handling unfamiliar formats. New sources, new jurisdictions, new record types, all interconnected when they enter the archive.

Government records are the ground truth for what businesses actually do. Not what they report. Not what they claim. For decades, that ground truth was scattered across incompatible systems with no unified way to search it. The archive is our answer to that. It grows every day, and so does the landscape of records it can surface.

Browse Gov Archive

Enigma Newsletter: Wyoming's Corporate Boom

Enigma — Thu, 12 Mar 2026 00:00:00 GMT

Welcome back to the Enigma newsletter. This week, we're looking at Wyoming's quiet rise as America's fastest-growing corporate jurisdiction — and what it means for anyone who has to verify the businesses they work with.

But first, your weekly data points.

Small business uncertainty declined in February for the first time in months. NFIB's optimism index suggests Main Street is cautious but stabilizing heading into Q1. (via NFIB)
Credit bureaus are resolving fewer consumer disputes since CFPB enforcement pulled back. Experian and TransUnion are leaving more errors on reports, a reminder that data accuracy degrades without accountability. (via ProPublica)
A congressional investigation estimates data broker breaches have cost consumers $20 billion in identity theft losses. Major brokers are now promising easier opt-outs under pressure. (via The Markup)
A New Jersey Girl Scout troop set up outside a cannabis dispensary and sold their whole load of cookies in one afternoon. Most GTM teams spend a quarter figuring out what these kids knew in five minutes. (via Marginal Revolution)

Wyoming's Corporate Boom

Another tiny state is quietly becoming America's corporate jurisdiction of choice

Every April, business owners, accountants, and formation agents file millions of new LLCs and corporations across the United States. Most choose their home state. A growing number choose Wyoming.

Our analysis of 13.3 million registered business entities indicates that Wyoming's annual corporate formation volume has grown nearly 7.9x — from 2,774 new entities in 2015 to 21,921 in 2024. Wyoming now processes the equivalent of roughly 59% of Delaware's annual volume, up from just 12% a decade ago.

Delaware Peaked. Wyoming Didn't.

Delaware remains the preeminent corporate jurisdiction for Fortune 500 companies and VC-backed startups. Its Court of Chancery, established case law, and corporate-friendly statutes have made it the default home for institutional capital for over a century.

But Delaware's formation numbers have stalled. Annual formations in our dataset peaked at 42,940 in 2021, a surge driven in part by the pandemic-era business formation boom. Delaware’s formation numbers subsequently declined to 36,949 in 2024. Wyoming's formations, meanwhile, have continued rising, growing by roughly 50% between 2022 and 2024 alone.

The composition of each state's corporate roster tells part of the story. Wyoming skews heavily toward flexibility: 82% of its entities are LLCs. Delaware has a more mixed profile — 59% LLCs and 33% traditional corporations — reflecting its continued role as the C-corp home for venture-backed companies.

The Privacy Angle

What draws entrepreneurs and small businesses to Wyoming isn't Delaware-style legal prestige. It's something more practical: privacy and low costs.

Wyoming offers features that Delaware doesn't match for small operators:

No state income tax on LLCs
$52 minimum annual fee (vs. Delaware's minimum $300 franchise tax)
Anonymous ownership: Wyoming LLCs are not required to disclose member names in most public filings
Strong charging order protections: creditors face higher barriers to piercing the entity to reach owners' personal assets

The privacy orientation shows up in the registration data. Our analysis indicates that 93% of Wyoming-incorporated entities have never registered in another state. They appear only in Wyoming's records.

Compare that to Delaware, where 32% of incorporated entities have expanded to at least one additional state — and among those that do, the average footprint spans 2.9 states.

Registration patterns suggest Wyoming entities are less likely to be operating businesses expanding their footprint and more likely to be purpose-built holding vehicles, investment entities, or structures where anonymity is a feature — though single-state registration is also consistent with businesses that simply operate locally.

The Compliance Angle

For compliance and KYB teams, Wyoming's rise presents a practical challenge.

Delaware's public records, while imperfect, are searchable. Wyoming offers less. When a Wyoming LLC appears on a customer application or in a beneficial ownership disclosure, the public trail is often thin: no officer names in state filings, limited registered agent details, no operating location data.

As Wyoming formations keep growing — our data suggests no slowdown, with 2024 matching 2023's record pace — the share of business entities with limited public registration footprints increases accordingly.

The Bottom Line

Wyoming's formation surge reflects a rational market response: when privacy is a selling point and costs are low, formation agents route clients accordingly.

In 2015, Wyoming processed one new corporate entity for every eight Delaware-incorporated. In 2024, that ratio is closer to one in two.

For anyone who needs to verify the businesses they work with, the state on an LLC filing matters more than it used to.

Methodology

Data source: Enigma registered entities dataset, as of Q1 2026

Dataset: 13.3 million registered entity records with formation dates, home states, and cross-state registration patterns

Formation trend: Annual counts of entities with home_state matching Wyoming or Delaware, filtered to formation years 2015-2024.

Entity type breakdown: Categorized by entity type for all entities with home_state in Wyoming or Delaware.

Single-state vs. multi-state: Entities with zero foreign registrations are classified as single-state. Entities with one or more foreign registrations have expanded to additional states.

Limitations: Formation counts reflect registrations in our dataset and may not match state-published totals exactly. Relative trends and growth rates are directionally reliable even where absolute totals differ from official state counts.

‍

Enigma Newsletter: No Branches, No Verification, No Problem

Enigma — Thu, 26 Feb 2026 00:00:00 GMT

No Branches, No Verification, No Problem

Inside Colorado's Money Services Boom

At 1312 17th Street in Denver, across the street from a sandwich shop and a cannabis dispensary, there's a PostNet store that serves as a commercial mailbox and shipping franchise. It's also one of the busiest money services addresses in America. 941 entities have registered with the federal government as Money Services Businesses (MSBs) using this single storefront, accounting for about 21% of all MSB registrations in Colorado.

Nearly all report zero branch locations. Nearly all claim to operate in every state and territory. And FinCEN — the Treasury Department bureau that maintains the registry — does not verify the information submitted.

What the data indicates

The federal Bank Secrecy Act requires money services businesses (check cashers, money transmitters, currency exchanges, and similar operations) to register with FinCEN. The registry is a cornerstone of anti-money-laundering enforcement. It's how the government knows who's moving money, where, and through what channels. As of February 2026, it contained roughly 35,800 active registrations.

Our analysis of FinCEN's publicly available data (cross-checked with Enigma’s business identity graph) indicates a striking geographic concentration. Colorado has 78.7 MSB registrations per 100,000 residents — 5 times California's rate (14.7) and 11 times New York's (7.1).

But the per-capita figure only tells part of the story. Colorado's MSB registrants also look different from the rest of the country on nearly every dimension:

96.3% report zero branch locations, compared to 76.6% in other states.
89.6% claim U.S. Postal Service activity, compared to 10.1% in other states.
44.8% check all eight activity categories on the registration form (issuer of traveler's checks, seller of money orders, check casher, money transmitter, prepaid access seller, currency dealer, and more) compared to just 3.6% in other states.
46.5% of all registrants who claim to operate in every U.S. state and territory list a Colorado address.

You’re probably wondering what’s driving this business surge. When we looked at the names of these MSBs, 326 of the 570 companies with cryptocurrency-related names nationally (57%) are registered in Colorado.

Why Colorado?

This concentration of MSBs traces to a specific policy decision. In 2018, Colorado's Division of Banking issued administrative guidance interpreting the state's Money Transmitters Act as not applying to businesses that transmit only cryptocurrency. Since crypto is not legal tender, the Division reasoned that transmitting it is not "money transmission" under state law.

The state legislature had actually rejected an explicit exemption bill that same year (SB18-277 failed 15-20 in the Senate). But the administrative guidance achieved much the same effect without a vote. And when Colorado overhauled its money transmission law in 2025 (HB25-1201), it excluded the optional virtual currency provisions from the model act — maintaining the favorable treatment.

The result is a regulatory asymmetry: crypto businesses operating from Colorado don't need a state money transmitter license, but they still need to register with FinCEN at the federal level. That makes Colorado the path of least resistance for many businesses. A virtual mailbox in Denver starts at roughly $10 per month. FinCEN's Form 107 is self-reported and, as a 2016 Treasury Inspector General audit found, the information submitted is not verified.

Other states with crypto-friendly frameworks also show elevated registration rates. Wyoming (60 per 100K) and Delaware (62 per 100K) also top the charts, but none approach Colorado's combination of high volume, maxed-out activity codes, and USPS claims.

What this suggests

This pattern does not, on its own, indicate fraud. Zero reported branches is normal for online businesses. Claiming all states is permitted on the form. And checking multiple activity codes may simply reflect the breadth of crypto-related services they offer.

But when WIRED reported in May 2025 that Xinbi Guarantee — a Telegram-based marketplace that facilitated $8.4 billion in illicit cryptocurrency transactions — had been incorporated in Aurora, Colorado, the same regulatory environment was at work: a state that does not require crypto businesses to obtain a money transmitter license.

The MSB registry exists to help law enforcement track money services activity. Registrations expire after two years, but the monthly volume of new filings has increased sharply since late 2024.

But when 941 of those registrations share a single mailbox address and claim to do everything everywhere, the registry may be generating noise rather than the signal it was designed to provide.

Methodology

Data source: FinCEN MSB Registrant data (msb.fincen.gov), downloaded February 2026; 35,821 active registrations after deduplication. Population figures from 2020 U.S. Census. State-level comparisons use registrants with a valid U.S. state address; Colorado-versus-other-states comparisons exclude Colorado from the baseline. Registration data is self-reported — FinCEN does not verify information submitted on Form 107. Analysis methodology and code available on request. All thanks and complaints will be gratefully received at Enigma.

‍

Enigma Newsletter: The Drug Testing Lab Industry

Enigma — Thu, 12 Feb 2026 00:00:00 GMT

The Drug Testing Divide

In December 2024, ProPublica reported on widespread accuracy problems at Averhealth, a drug testing company used in Michigan child welfare cases. Faulty tests led to false positives that separated parents from their children — sometimes for months — before these errors were discovered.

The investigation raised questions about quality control across an industry that affects millions of workers, job applicants, and families involved in the criminal justice system. How is this industry structured? And who's doing the testing?

Enigma’s analysis of 1,289 drug testing labs reveals a fragmented industry with striking geographic patterns. The South has more than twice as many labs per capita as the Northeast, while states where marijuana remains illegal have about 50% higher lab density than legal states. And the distribution doesn’t follow simple explanations.

The Geographic Pattern

The South leads the nation with 4.0 drug testing labs per million residents — 2.5 times more than the Northeast’s 1.6 per million. But the map reveals a pattern that cuts across regional lines: states with large oil, gas, and mining industries cluster near the top regardless of region. Wyoming — a state with a small population where energy extraction dominates the economy — leads the nation by a wide margin. Other states focused on energy industries like Colorado, Louisiana, North Dakota, and Montana round out the top five. Federal DOT and pipeline safety regulations require drug testing for many roles in these industries, which may help explain the concentration.

Colorado’s position is particularly striking: it was the first state to legalize recreational marijuana in 2012, yet it has 8.2 drug testing labs per million — more than five times New York’s 1.6. The answer likely lies in Colorado’s Western Slope, where labs in cities like Rifle and Grand Junction serve the oil and gas workforce.

Texas has the most labs in absolute terms (157), but its per-capita rate of 5.1 per million is only middle of the pack. Meanwhile, Illinois — one of the nation’s largest economies — has just 9 labs statewide (0.7 per million), one of the lowest rates in the country. Maine has zero drug testing labs due to a unique policy requiring samples to be sent to certified facilities in other states.

The Cannabis Correlation

States where recreational marijuana is not legal have 3.8 labs per million — 1.5 times more than legal states’ 2.5 per million. You might expect the opposite: more testing in states where marijuana is legal as employers try to enforce workplace drug policies. Instead, legalization appears to be associated with reduced demand for testing services.

But the pattern isn’t uniform. Some states with legal marijuana (like Colorado) still have high lab density, while some prohibition states have relatively few labs.

Explaining Regional Patterns

Several factors could contribute to the South’s higher testing rates:

Industry composition: States with large transportation, energy, or manufacturing sectors may have more testing because federal safety regulations (DOT, PHMSA) require it for certain roles—though we cannot verify which industries our identified labs primarily serve.

State regulatory environment: States with permissive drug testing laws have nearly twice the lab density (4.1 per million) as restrictive states (2.3 per million). Permissive states allow broader pre-employment screening and random testing, while restrictive states limit testing to safety-sensitive roles or reasonable suspicion.

Workers compensation incentives: Many states offer insurance premium discounts (5-10%) to employers with drug-free workplace programs, though these vary by state.

Court and probation testing: Drug testing isn't just for employment — labs also serve criminal justice systems. States with more drug courts or probation programs may need more testing infrastructure.

Data limitations: Our name-matching approach may identify labs more easily in some states than others, especially if naming conventions differ.

A Fragmented Industry

Unlike many healthcare sectors, the drug testing labs we identified appear highly fragmented. The median lab generates only $29,741 in annual revenue through credit card transactions, suggesting small, independent operations.

This fragmentation raises questions about standardization. In December 2024, ProPublica reported on accuracy concerns with one testing company (Averhealth) in Michigan child welfare cases. While that case involved a larger operator, the prevalence of small independent labs suggests the industry may lack the resources for rigorous quality control systems.

The Bottom Line

Drug testing lab density varies dramatically across American states, driven by some combination of:

Federal safety testing requirements in specific industries
State laws that make testing easier or harder for employers
Workers compensation insurance incentives
Criminal justice system demand
Regional workplace culture
Marijuana legalization status

For businesses operating across states, this means navigating wildly different testing norms. For workers, whether you're tested may depend as much on where you work as what you do.

What’s clear is that marijuana legalization alone doesn’t predict testing rates — the story is more complex.

Methodology

Data source: Enigma business intelligence data (brands and operating locations)

Sample size: 1,289 drug testing labs identified via name pattern matching

Identification approach:

Included: Businesses matching drug testing keywords in name (toxicology, specimen, DOT testing, drug screen, etc.)
Excluded: Substance abuse treatment centers, sober living facilities, rehab programs

Key definitions:

Drug testing labs: Facilities advertising drug testing, toxicology, workplace screening, or specimen collection services
Per capita rates: Labs per million residents using 2023 Census population estimates
Marijuana status: Based on recreational legalization status as of 2024
Regulatory categories: PERMISSIVE/MODERATE/RESTRICTIVE based on state employment law classifications (sources: SHRM, Littler, state labor departments)

Limitations:

Cannot verify which industries or employers these labs serve - Pattern matching on business names identifies facilities but not their client mix
May include non-workplace testing - Labs may primarily serve court-ordered, medical, or sports testing rather than employment screening
Incomplete coverage - Name-matching likely misses hospital labs, occupational health clinics, and testing services without relevant keywords
No testing volume data - Facility count doesn't measure actual testing activity or market concentration
Industry employment data not analyzed - Correlation between lab density and industrial composition is inferred, not proven
Simplified regulatory categorization - State laws are complex and don't fit neatly into three categories
Revenue data available for ~40% of labs
Data reflects business registrations as of snapshot date — and may not capture recent closures or openings

External sources:

State marijuana legalization timeline: NCSL/NORML
State drug testing laws: SHRM, Littler, state labor departments
Population estimates: U.S. Census Bureau 2023

When Compliance is Skin Deep

Enigma — Thu, 29 Jan 2026 00:00:00 GMT

How Med Spas Navigate — and Sometimes Evade — Medical Practice Regulations

When New York State investigators inspected 223 medical spas in late 2024, they found violations at 87 facilities(paywall) — including expired products, suspected counterfeit injectables, and at one location, controlled substances including fentanyl. According to the Department of State's warning to consumers, 100% of the facilities they inspected were offering medical procedures without proper licensure, and 73% lacked medical oversight during procedures.

Enigma’s analysis of 12,646 med spa businesses nationwide reveals a structural pattern that helps explain these findings: 85.9% of med spas nationwide classify themselves under beauty industry codes rather than medical facility codes, potentially allowing them to avoid the stricter licensing, inspection, and medical director requirements that traditional medical practices face.

The Human Cost

Before examining the data, it's important to understand what’s at stake. Beyond this month’s crackdown in New York, recent incidents documented by major outlets highlight the consequences of inadequate oversight:

Texas (2023): A patient died after receiving IV therapy at a med spa where no licensed medical professionals were present. The unlicensed owner personally administered the treatment.
California (2021): A patient contracted a drug-resistant infection after receiving fat-dissolving and vitamin injections at a Los Angeles med spa, resulting in $2 million in medical debt and permanent scarring after two years of treatment.
Pennsylvania (2023): A court awarded $1.25 million after a nurse with a suspended license performed facial injections at a med spa.

New York’s investigation documented similar risks: unsanitary conditions with used needles in overflowing sharps containers, medications stored in refrigerators alongside staff lunches, and unlicensed practitioners performing procedures that can cause burns, infections, and allergic reactions.

A Rapidly Growing Industry

The med spa industry has shown explosive growth over the past decade, with annual business registrations nearly tripling from 91 in 2020 to 252 in 2024. This growth reflects changing consumer preferences toward aesthetic medicine and wellness services that combine medical procedures with a spa-like experience—and an estimated $3.5 billion in annual revenue nationwide.

But as the industry has boomed, its regulatory structure has remained ambiguous. Med spas offer medical procedures — injections, laser treatments, chemical peels — that typically require physician oversight. Yet many operate under business structures designed for hair salons and day spas.

The Classification Gap

Our analysis reveals that 85.9% of med spas nationwide — some 10,175 businesses — classify themselves under NAICS code 812 (Personal Care Services: nail salons, spas, and beauty shops) rather than NAICS code 621 (Ambulatory Health Care Services: medical offices and clinics).

This matters because business classification often determines:

Which state agency conducts inspections (cosmetology boards vs. medical boards)
What licensing requirements apply to the business and its practitioners
Whether a medical director is required to be present during procedures
What liability insurance minimums are mandated
How frequently facilities undergo safety inspections

Important caveat: Many med spas may legitimately classify as personal care services if they primarily offer non-medical spa treatments. NAICS codes are self-reported, and the classification system wasn't designed for businesses that blur the line between medical practice and wellness services. However, the pattern raises questions about whether current regulatory frameworks adequately protect consumers receiving medical procedures at these facilities.

The New York Pattern: LLCs vs. Professional Corporations

New York provides a clear example of how med spas structure differently from traditional medical practices. In New York and many other states, the Corporate Practice of Medicine doctrine prohibits non-physicians from owning medical practices, requiring doctors to use Professional Corporations (PCs) or Professional Limited Liability Companies (PLLCs).

We analyzed 1,205 traditional medical practices in New York and found that 87.9% use PC or PLLC structures, while only 2.8% use standard LLCs.

Among New York's 618 med spa businesses, the pattern reverses: approximately 59.2% are organized as standard LLCs — a 56.4 percentage point difference from traditional medical practices.

Context matters: Not all LLC structures violate the law. Many med spas use Management Services Organizations (MSOs) — legitimate arrangements where an LLC handles business operations while a separate PC owned by a physician provides medical services. Some national chains structure each location this way to comply with state regulations.

However, a joint investigation by the NYC Council and state agencies found that many med spas fail to maintain proper separation between business and medical entities, lack physician supervision despite offering medical procedures, and operate without the required medical director oversight.

From California to Florida: A Nationwide Pattern

The structural gap persists nationwide, even in states with different regulatory approaches:

Texas (1,575 med spas): Strict Corporate Practice of Medicine laws, yet high LLC usage
Florida (1,469 med spas): No Corporate Practice of Medicine restrictions, allowing direct LLC ownership
California (1,278 med spas): Strict regulations, ongoing enforcement challenges
New York (618 med spas): Strong laws, limited inspection resources

LLC usage among new med spa registrations has consistently ranged from 68.7% to 77.4% annually — far higher than traditional medical practices — regardless of whether the state prohibits or permits non-physician ownership.

This suggests the pattern reflects something broader than state-by-state regulatory variation: an industry-wide approach to business structuring that may prioritize operational flexibility and investor access over compliance with medical practice regulations.

Why This Matters

The structural patterns we observe — beauty industry classification, high LLC usage in states that restrict it, and inconsistent entity naming — suggest systematic challenges in how med spas are regulated:

Regulatory gaps enable risky practices: When businesses offering medical procedures classify as beauty services, they may evade medical board oversight and physician supervision requirements.
Enforcement is inconsistent: State medical boards typically focus on individual practitioners, not business structures, leaving compliance largely to chance.
Consumers lack transparency: Patients may not realize their provider operates outside traditional medical oversight or that their injector lacks appropriate credentials.
Investment incentives conflict with medical regulations: The med spa model attracts private equity and franchise investment, but many states prohibit non-physician ownership of medical practices for patient protection reasons.

As New York Secretary of State Walter Mosley stated in the warning to consumers: “Unlicensed or unqualified staff, dirty needles, expired or counterfeit drugs … can lead to serious injury or even death.”

The Path Forward

The med spa industry’s rapid growth has outpaced regulatory frameworks designed for traditional medical practices or beauty salons. While many med spas undoubtedly operate responsibly through compliant MSO structures or in states that permit their business models, the enforcement actions in New York and elsewhere reveal systematic problems.

Several states have begun addressing these gaps:

New York: Expanded inspection authority and clearer medical director requirements
Texas: Increased penalties for unlicensed practice of medicine
California: Enhanced enforcement against corporate practice of medicine violations

But comprehensive reform will require clearer federal guidance on when aesthetic procedures require physician oversight, standardized definitions of what constitutes a “medical spa” versus a “beauty spa,” and adequate state resources to inspect facilities offering medical procedures regardless of how they classify themselves.

Until then, consumers should verify that their med spa employs appropriately licensed practitioners, maintains a qualified medical director, and operates under the same oversight as any facility offering medical procedures — because the structural classification may not reflect the medical reality of the services provided.

Methodology

This analysis is based on Enigma's business intelligence data, examining 12,646 businesses with “med spa,” “medical spa,” or “medspa” in their registered business names.

Data sources: Business registration records from state databases, self-reported NAICS classifications, transaction-based revenue estimates, and entity type analysis from corporate filings.

Validation approach: Our national count of 12,646 med spas aligns with industry estimates from the American Med Spa Association (AmSpa).

Important limitations:

NAICS codes are self-reported and may not reflect actual regulatory oversight
Entity structure alone does not indicate legal compliance or non-compliance
Many legitimate business arrangements (MSOs, franchise models) may appear as LLCs
State-level aggregations combine businesses operating under different local regulations
Revenue estimates based on card transaction data may not capture full business activity
Analysis does not examine individual compliance with state-specific requirements

Filed Under Fiction

Enigma — Thu, 15 Jan 2026 00:00:00 GMT

Inside the Unvalidated World of State Business Registrations

In August 2023, a professional corporation called Lemonade Lagoon was registered in Utah. Its leadership roster reads like a fever dream of American celebrity:

Donald J. Trump, President
Elon Musk, Vice President
Melania Trump, Officer
Joseph Rogan, Secretary
Edward Snowden, Officer
Jaron Lanier, Officer

The company lists addresses spanning Palm Beach, Beverly Hills, Los Angeles, and Austin. It remains active today.

Sure, it's possible that this wide-ranging group of prominent individuals — including the current president, his wife, a tech billionaire, a leading podcaster, a whistleblower living in Russia, and a pioneer of virtual reality — have come together for a lemonade-focused professional endeavor. Or perhaps the real officers of this corporation were having a bit of fun with their paperwork.

Utah accepted the filing without question. The state issued a file number, recorded the officers, and made it official.

Welcome to the strange world of state business registrations, where legitimate filings may contain highly suspicious names. Where Mickey Mouse can serve as CFO, where Bill Gates and Elon Musk can co-found a school for "estoric sciences" in rural Missouri, and where someone can register Slappy Sammies Sticks for Kicks LLC with John Doe as a member.

These aren't data errors or clerical mistakes. They're real filings in official state corporate registries — documents formatted like authoritative records but containing information that may be false, unproven, and even ridiculous.

The Hall of Corporate Absurdity

State corporate registries are the foundation of business identity in America. When a company incorporates, its officers go on the record — names that banks will check, regulators will reference, and compliance teams will verify. The assumption is that these names mean something. But sometimes, they don’t.

We analyzed over 103 million business registrations across all 50 states and found companies that listed celebrity names, fictional characters, or obvious placeholders as corporate officers. Here are some of the most remarkable:

The All-Star Lemonade Venture

Lemonade Lagoon A Professional Corporation, registered in Utah in August 2023, assembles perhaps the most implausible executive team ever filed with a state registry. Donald Trump serves as President, Elon Musk as Vice President, Joe Rogan as Secretary, with Melania Trump, Edward Snowden, and virtual reality pioneer Jaron Lanier filling out the officer roster. The filing lists seven different addresses across five states, from Palm Beach to Beverly Hills. File number 13525371-0144 remains active and in good standing.

The Tech Billionaire Partnership That Never Was

The School of AI and Estoric Sciences — note the creative misspelling of "Esoteric" — listed Bill Gates and Elon Musk as incorporators for a company in Greenfield, Missouri. Two tech billionaires are competitors who have never been business partners, yet according to this record they have co-founded a school teaching mystical sciences in a town of 1,400 people. Missouri accepted the 2020 filing but the company was later dissolved, having never operated.

Donald Trump's Occult Venture

The Occult Defense Department Co. was registered in Texas in September 2025, listing Donald Trump as a director. Whatever supernatural threats this company was meant to defend against, it never got around to it. No operating locations. No revenue. No actual business. But it has an official Texas file number and remains in good standing.

Slappy Sammies and John Does

In April 2022, someone registered Slappy Sammies Sticks for Kicks LLC in Arizona. The absurd name sailed through without question. So did the officers: John Doe and John Doe Elliott. Arizona accepted the paperwork, issued a file number, and made it official. The company remains active.

A Royal Tech Venture

In December 2020, someone registered Mother Queen Elizabeth I Preventive Tech-Magicare, APP in California. The name alone — combining deceased British royalty, preventive care, magical healing, and "APP" — suggests this wasn't a serious business venture. The registered agent? Bill Gates. California accepted the filing without comment.

The Disney Executive Suite

A company called Auga Properties, LLC operated in Georgia with Mickey Mouse listed as Chief Financial Officer and Minnie Mouse as Corporate Secretary. The CEO appears to be a real person. The Disney characters are not. Georgia accepted the filing in 2006 and let it stand for years.

The Placeholder Executives

The Little White Plastic Spoon Ministries, Inc., registered in Georgia, appointed John Doe as CEO, CFO, and Secretary simultaneously. One placeholder name holding every executive position. The state approved it.

A Nationwide Phenomenon

This isn't isolated to a few permissive states. We found celebrity officer filings across the country. Among our search for examples of the most ridiculous filings, Florida appeared most frequently, followed by California and Texas — although admittedly this reflects where we found the best stories, not a comprehensive fraud rate.

The categories of fake officers break down across several types: obvious placeholders like "John Doe" or "Jane Doe," controversial figures (Jeffrey Epstein, Ghislaine Maxwell), politicians (Donald Trump, Joe Biden), business and tech leaders (Elon Musk, Jeff Bezos, Mark Zuckerberg), entertainment celebrities (Kanye West, Kylie Jenner), and fictional characters (Mickey Mouse, Harry Potter, Darth Vader).

What's remarkable isn't just that these filings exist, it's that states continue to accept them. Slappy Sammies is still active in Arizona. Lemonade Lagoon remains in good standing in Utah. Occult Defense Department Co. was registered just months ago in Texas.

The Pattern Behind the Absurdity

Nearly all of these companies share a common trait: they never became active businesses.

When we cross-referenced the 455 companies we found with suspicious officers against Enigma's database of 32 million+ operating businesses in the United States, we noticed that 449 out of 455 (98.7%) have zero operating signals. No physical locations. No revenue. No employees. No customer reviews. No web presence. No evidence they ever conducted business.

They exist only as paperwork — legal entities on state registries with no commercial reality behind them.

The few exceptions are telling. A small number of companies in our dataset do have business activity, but closer inspection suggests these may be legitimate businesses run by people who happen to share names with celebrities — a regular person named Michael Jordan or Will Smith, not the celebrity. The distinctive names — Elon Musk, Beyonce Knowles, Mickey Mouse — overwhelmingly appear in paper-only entities that never operated.

This pattern suggests these filings serve various purposes, none of them involving actual business:

Placeholder registrations - Formation services or lawyers creating entities that will later be transferred to real clients, using "John Doe" as a temporary name.

Abandoned ventures - Someone started the registration process as a joke or test and never finished.

Name parking - Reserving business names or entity structures for future use (or preventing others from using them).

Identity testing - Seeing whether states will accept obviously fake information (they often will).

Whatever the intent, the result is the same: state registries are filled with formally recognized businesses that list officers who don't exist or never consented to the role.

Why It Matters for Business Verification

For compliance professionals, these phantom executives aren't just amusing curiosities. They're red flags about data quality.

If a business lists Mickey Mouse as CFO, what else on that registration might be fabricated? If states accept "John Doe" without verification, what about the business address? The revenue projections? The stated business purpose?

State corporate registries were designed as official records, but they function more like bulletin boards where anyone can post anything. The information is formatted like authoritative data — file numbers, issue dates, officer titles, registered agents — but in many cases it's fundamentally unverified self-reporting.

This is precisely why Know Your Business (KYB) verification exists. The official record is a starting point, not a conclusion. Companies need to cross-reference registration data against operating signals:

Does this business have revenue?
Does it have physical locations with employees?
Does it have a web presence?
Do customers review it?
Does it process payments?

In the vast majority of cases we identified, companies listing phantom executives fail every test. This gap between registration and reality reveals something important about business data: official doesn't mean verified.

The real problems arise when downstream systems treat unverified registration data as verified fact. When a bank, vendor, or partner looks up a business in the state registry and assumes the listed officers are real. When compliance software auto-populates fields with officer names that were never validated. When "official state records" becomes synonymous with "true information."

And many of these suspicious filings remain active today. The records are real. The file numbers are real. But the officers? The business itself? That requires a closer look.

Methodology

Data source: Enigma business intelligence data as of January 2026

Dataset: Analysis of 103+ million registered business entities across all U.S. states

Sample: 455 companies with celebrity names, fictional characters, or obvious placeholders listed as corporate officers

Verification approach:

Cross-referenced company names with state corporate registry records
Confirmed officer names and filing details through official state databases
Checked for operating signals (revenue, locations, employees) in Enigma's database of 32+ million active U.S. businesses
Excluded ambiguous common-name matches (e.g., "Michael Jordan" could be someone other than the famous athlete)
Focused on distinctive names unlikely to be coincidental (Elon Musk, Mickey Mouse, Queen Elizabeth I)

Key finding: 449 of 455 companies surfaced using this method (98.7%) showed zero operating signals—no revenue, no physical locations, no employees, no web presence

Limitations:

Sample selected for clear examples of absurd filings, not statistical representation
Cannot definitively prove intent behind false officer names
State distribution reflects where we found the best examples, not comprehensive fraud rates
Some legitimate businesses may employ people who coincidentally share celebrity names

Geographic distribution: Companies identified across multiple states, with Florida, California, and Texas appearing most frequently in our curated examples

All company file numbers and formation dates are available upon request.

2025: The Year Enigma Went Back to Our Future

Enigma — Tue, 23 Dec 2025 00:00:00 GMT

When we first showed Enigma to the world in 2013, we had seen what was possible when small details got stitched together from data already out in the world. Search for a company and discover datasets you didn't know existed. Stumble onto something unexpected that revealed how the business actually operated. See the detailed, multifaceted activities of a business or the behaviors of the whole economy, all from data that already existed but needed to be connected to be fully understood.

That was the original promise. Take the scattered filings and registrations and contracts that governments publish, connect them, and let people discover things they didn't know to look for.

Over the years, Enigma tuned our data and our architecture to focus on specific problems. We built KYB verification that financial institutions relied on. We assembled sales and marketing data that helped companies find undiscovered prospects. These were real products that solved real problems. But somewhere along the way, we lost something in our core products. The ability to see the connected picture. The unexpected discovery and the capability to see exactly what a business was doing. The moment where you click through three datasets and suddenly understand something about how a business actually operates.

By the end of 2024, we knew we had to make a choice. Keep building on top of an architecture that made each new thing harder than the last, or go back to the foundation and rebuild.

We rebuilt.

graph-model-1 debuted in March. It's a new data architecture that represents how businesses actually exist: a brand identity that customers see, a legal entity that files paperwork, physical locations where things happen, and the connections between all of them. Your corner coffee shop has a name on the awning, an LLC on file with the state, and maybe a second location across town under the same ownership. These are all the same business. Our data model now knows that.

When we publicly launched graph-model-1 in July for all users, the new Enigma Console launched alongside it. You can build lists using semantic search, the way you'd actually describe what you're looking for rather than hunting through filters. You can upload a file and get it enriched with revenue signals, industry classifications, ownership data. And when you pull up a business, you see the connected picture again. The legal entity, the DBAs, the locations, the economic activity. What we could show in a demo twelve years ago, but now at scale, and extraordinarily accurate.

The KYB API got rebuilt on top of this new foundation. More verifications resolve automatically. Fewer get kicked to manual review. The data is fresher because the underlying system was designed for freshness from the start, not patched to support it.

Those were the products we already had, made better. But the rebuild also made new things possible.

We built the Enigma AI Connector, a way for AI models to query Enigma directly using MCP: from graph-model-1 with its rich identity and payment transaction data, through to the complete records of the Government Archive. This matters because language models are bad at business identity. They'll tell you confidently that a business is at an address it left two years ago, or mix up two companies with similar names. When they can query current, verified data instead of guessing from training data, they stop guessing and start knowing. The improvement isn't marginal.

And of course, the Government Archive is now live. It's the government data corpus we've been assembling for years: far beyond corporate registrations; it's the professional licenses, enforcement actions, permits, contracts, shipping records, registered plans, and more that businesses generate as they operate and governments record as they work with these entities. It's all connected to the right business, so it's just a matter of querying for what you need.

The first thing coming in 2026: On-Demand Attributes. You tell us what you need to know about a business. We find it in the government records. Custom structured research, from authoritative and trusted data, delivered programmatically.

Twelve years ago, we showed what was possible when public data gets connected. This year, we rebuilt Enigma to bring those capabilities to AI-backed systems and experts augmented with AI tools. We’re going to keep doing it, because creating trusted business data is who we are. Thanks for building with us.

The Holiday Spending Cycle

Enigma — Tue, 23 Dec 2025 00:00:00 GMT

Everyone knows December is the peak holiday shopping season—and for most retailers, that's true. Electronics stores see revenues surge +165%, jewelry retailers jump +84%, and gift shops boom +67%.

But while retail shops overflow with shoppers, a different story unfolds across town: hotels sit half-empty (-25%), movie theaters go dark (-75%), and casinos struggle (-69%). America’s holiday spending patterns aren't just seasonal — they’re bifurcated, with some industries soaring while others crash.

And the real surprise? The January-February post-holiday reckoning hits both groups: gift retailers see their gains evaporate in a brutal revenue crash, while hospitality and entertainment struggle to recover from their December slump.

Enigma’s data reveals a nuanced picture of how American spending patterns shift across the calendar year — and it challenges some conventional assumptions about when and where consumers actually spend the most.

Using transaction data covering 410,001 holiday-sensitive businesses across the United States during the five years spanning 2020 through 2024, we analyzed monthly spending patterns to understand the true economics of seasonal retail. What we found offers a revealing picture of America's annual spending cycles.

December’s Surge — In Context

The December spending spike is real. From 2020 to 2024, December revenue in holiday-sensitive industries averaged 8.9% above the monthly average, with transaction counts up 3.2%. That's $349.8 billion in December revenue in 2024 alone.

Then comes the post-Holiday slump: spending follows a predictable pattern when consumers reset their budgets in January and February. Revenue in holiday-sensitive industries runs -13.2% below the annual monthly average. This isn’t a collapse — it's a natural recalibration after December’s elevated spending. Consumers who splurged on gifts and celebrations in December naturally scale back in the following months.

The full picture:

December boost: +8.9% for one month
January-February adjustment: -13.2% for two months
Annual pattern: Spending concentrates in December and summer, moderates in winter and spring

Retailers who understand this full cycle — rather than focusing solely on December — can better plan inventory, staffing, and promotions throughout the year. The January-February slowdown is a regular pattern to anticipate.

Where the Holidays Hit Hardest

The December surge and January slowdown vary dramatically by geography, revealing fundamental differences in local economies.

Top states by December revenue lift (2020-2024 average):

Tennessee: +115.6%
Alaska: +107.9%
West Virginia: +68.0%

Meanwhile, states like California and Texas show December lifts of just 15-20%, suggesting more stable year-round spending patterns.

What drives these differences? The pattern defies conventional wisdom. Suburban retail hubs like Plano, TX and Arlington, VA see massive December lifts of +155% and +128% respectively, as residents flock to shopping centers for holiday purchases. Meanwhile, major tourism destinations like Las Vegas and Orlando actually see December declines of -14% and -9%, suggesting holiday travelers stay home with family rather than visiting entertainment capitals.

Industries With the Strongest Holiday Patterns

Some industries ride the holiday wave. Others maintain steady rhythms regardless of the calendar.

Biggest December winners (industries with highest December lift):

Electronics and Appliance Retailers: +165.1%
Sporting Goods, Hobby, Musical Instrument, and Book Retailers: +92.9%
Jewelry Retailers: +83.7%
Gift, Novelty, and Souvenir Retailers: +67.1%

Jewelry retailers, gift shops, and specialty stores see predictable December surges tied to gift-giving traditions. But the data also reveals surprises: personal care and beauty services see a +64% December lift — salons and spas benefit from holiday party preparations and gift certificate purchases. Meanwhile, pharmacies see a +40% boost, likely driven by seasonal cold and flu medication needs.

Industries with minimal seasonal variation:

Limited-Service Restaurants: 0.0%
Convenience Stores: -1.5%
Supermarkets and Grocery Stores: +11.7%

Some sectors — particularly essential services like fast food, convenience stores, and grocery — show minimal December fluctuation, demonstrating stable baseline demand that resists seasonal shifts. These industries benefit from predictable revenue streams year-round.

The industries with the strongest December peaks also tend to see the sharpest January declines. Electronics retailers (+165% in December) and jewelry stores (+84%) face the steepest post-holiday crashes as gift-buying ends abruptly. In contrast, accommodation and entertainment industries that decline in December (-90% and -69% respectively) see January recoveries as travelers return and normal leisure spending resumes. This creates planning challenges but also opportunities for retailers who can anticipate and prepare for these swings.

Some sectors — particularly essential services, grocery, and everyday necessities — show minimal December fluctuation, demonstrating stable baseline demand that resists seasonal shifts. These industries benefit from predictable revenue streams year-round.

But the data also reveals surprises: personal care and beauty services see a +64% December lift — nearly matching jewelry retailers. Holiday party season drives demand for salon appointments, spa treatments, and gift certificates. Beauty salons (+14%) and nail salons (+12%) see steady December booms as consumers prepare for celebrations and purchase pampering experiences as gifts.

The Counter-Cyclical Exception

Not every industry rides the December wave. Some sectors see dramatic declines as consumer behavior shifts:

December losers (industries that drop in December):

Accommodation: -90%
Motion Picture and Sound Recording: -75%
Amusement, Gambling, and Recreation: -69%
Hotels and Motels: -25%

Why the crash? During the holidays, Americans stay home with family rather than traveling for leisure or entertainment. Las Vegas casinos, movie theaters, and hotels see revenues plummet while gift shops boom. These industries face the opposite challenge: surviving a December slump before January recoveries.

For accommodation and entertainment businesses, the “holiday season” isn’t a peak — it’s a trough to weather.

The Full Picture

Yes, December is the holiday shopping season. Revenue surges, driven by gift-giving traditions and year-end celebrations. But zoom out and you'll see summer months consistently outperform December in absolute dollars, and January-February’s decline is a predictable recalibration, not a crisis. That’s the real story of American retail: not a single make-or-break month, but a predictable cycle of seasonal spending patterns.

Methodology

This analysis uses Enigma’s proprietary transaction data covering 410,001 operating locations in 105 holiday-sensitive industries from January 2020 through December 2024. Industries include retail (clothing, electronics, jewelry, gifts, general merchandise), food and beverage (restaurants, bars), accommodation (hotels), entertainment, and personal services. We focused on businesses where holiday shopping behavior drives significant seasonal variation.

Monthly revenue lift calculated as the percentage deviation from annual monthly average. State and city analyses filtered to major markets with sufficient data coverage. All dollar figures inflation-adjusted to 2024 levels.

Match Points: A Field Guide to Business Identity

Enigma — Thu, 18 Dec 2025 00:00:00 GMT

Closing the gap between paper presence and operating signals

On paper, most risky businesses look ordinary. A storefront has a familiar brand on the awning, a tidy LLC on file in Delaware, and a mailbox that never misses a certified letter. An aggregator submits a neat spreadsheet of sub-merchants. A healthcare supplier shows a string of active registrations. Then the money starts moving — and only later do investigators learn the brand was a franchisee with no real operations, the sub-merchants were shells, and the supplier's “office” was a commercial mail receiving agency.

These failures result from treating registration artifacts as proof of a living business.

Modern KYB work lives in the gap between what a company says it is, what the filings imply, and what the operating signals can actually support. That gap is widest in familiar patterns: sole proprietors trading under DBAs, clusters of entities at the same registered-agent or virtual address, franchise locations whose legal entities don’t match the brand on the door, and networks that reuse phones, domains, and maildrops to obfuscate control.

This article looks at where KYB breaks down in the real world — and how to close those breaks with signals that reflect how businesses actually operate.

The Paper-Only Problem

239,000 brands in Enigma’s business graph have valid registration artifacts — an LLC filing, a registered agent address, an “active” status — but show zero operating signals. No card revenue in the past 12 months. No customer reviews. No phone number or website. No open locations.

Some of these “paper-only” businesses are legitimately dormant — side projects, holding companies, structures awaiting activation. But many are the raw material of fraud: shells used to launder transactions, nominees that exist only to receive certified mail, or application-mill artifacts designed to pass automated onboarding checks.

The problem compounds when onboarding teams have seconds, not hours, to decide. A Delaware LLC with a registered agent and an “active” status looks identical whether it’s a real operating business or a front. Standard KYB checks — registry lookup, OFAC screening, address validation — are often not enough to separate paper presence from operating reality.

Where Risk Hides: Address Clusters

Registration addresses can lie — not always, but often enough to matter.

Enigma’s data reveals that 934 likely registered-agent or commercial mail receiving agency (RA/CMRA) addresses host extreme concentrations of brands — with one location hosting 1,913 brands. These locations host hundreds of entities with the same mailbox, and yet only 21% show card revenue in the past year.

These addresses are friction reducers. They let a new business incorporate quickly, receive legal mail reliably, and maintain a clean paper trail. They also generate opacity. When 300 LLCs share the same suite in Wilmington, Delaware, distinguishing the legitimate businesses from the shells requires more than a registry check.

Enigma’s address intelligence flags these patterns automatically: registration vs. operating address splits, virtual/CMRA indicators, residential vs. commercial classification, and deliverability context. When a merchant’s operating address is a real storefront in Brooklyn but its registration address is a mail drop in Dover, that mismatch is a signal — not proof of fraud, but enough to escalate the review.

But here’s the highest-risk pattern: entities that cluster at RA/CMRA addresses and show no operating presence anywhere else. These are candidates for the “shell merchant” pattern that enables transaction laundering and aggregator blind spots.

Industry Risk: Where Paper Entities Concentrate

Not all industries carry the same KYB risk. Some sectors — religious institutions, cemeteries, certain holding structures — legitimately operate without generating card revenue or online reviews. Others show unusually high paper-only rates for less obvious reasons.

Among industries with 5,000+ registered entities:

Religious institutions have a 24.8% paper-only rate
Cemeteries have a 20.3% paper-only rate
Housing developments have a 11.5% paper-only rate

High paper-only rates don’t always signal fraud — they often reflect legitimate business models where standard operating signals are sparse. But they do signal onboarding risk: these sectors require manual review because automated checks can’t distinguish dormant-but-legitimate from never-was-real.

Healthcare is a particularly complex case. 3,412 addresses host 5+ healthcare-related brands each, with a 33% revenue rate. Some of these are legitimate medical office buildings or hospital campuses. Others match the profile of nominee billing operations — entities spun up rapidly, clustering at shared addresses, showing minimal revenue or patient-facing activity.

Enigma’s edge: Cross-referencing address clusters with operating signals (revenue, reviews, multi-location presence, licensing context) helps distinguish real medical facilities from paper entities designed to submit claims.

The PPP-Era Surge

The 2020-2021 PPP era offers a natural experiment in business formation patterns. During those two years, 974,000 new brands were formed — 21% more than in the prior two-year period.

Counterintuitively, the paper-only rate declined during the PPP era, from 2.1% in 2018-2019 to 1.4% in 2020-2021. This likely reflects:

Many legitimate businesses forming to access relief funds
Fraudulent entities generating some operating signals to pass verification
Older cohorts having more time for legitimate businesses to go dormant

The fraud signal is new formation + virtual address + no operating signals + sudden transaction activity — a multi-factor pattern that requires layered screening.

Zombie Entities: When Old Businesses Come Back to Life

Perhaps the most insidious pattern is the zombie entity: a business registered years ago, dormant for its entire existence, then suddenly showing transaction activity.

Among entities 10+ years old, 2.5% are zombies — valid registration, zero operating signals, no historical revenue or reviews. That’s 140,000 zombie entities in the 10+ age cohort alone.

This pattern is consistent with:

Purchased shelf corporations used for PPP fraud or bank account fraud
Business email compromise schemes impersonating dormant entities
Money laundering operations using aged entities with clean histories

Enigma’s edge: Operating signal timelines reveal when an old entity suddenly “comes to life” — a pattern invisible to static registry checks but obvious when you track revenue, reviews, and transaction history over time.

Rapid Proliferation: The Check-Cashing Shell Pattern

Twenty-seven US addresses show rapid entity proliferation — 10+ brands formed within a 3-year window, clustering at the same location. On average, these addresses have only 35% of brands showing card revenue.

This temporal clustering pattern is consistent with:

Check-cashing and funnel account schemes that cycle through shell entities
Application mill operations generating entities at scale
Nominee billing networks in healthcare and other sectors

Enigma’s edge: Temporal clustering + operating signal correlation identifies suspicious networks that look legitimate when viewed individually but reveal their structure when analyzed as a graph.

Contact Graph Signals: Shared Phones, Domains, and Registered Agents

Multiple brands sharing the same phone number or website can indicate:

Legitimate: Franchise systems, management companies, shared services
Suspicious: Nominee structures, shell networks, white-label fraud operations

When combined with other risk signals — RA/CMRA addresses, paper-only status, rapid formation — contact reuse becomes a key indicator of coordinated networks designed to evade KYB controls.

Franchise Confusion: The Right Brand, The Wrong Legal Entity

26,000 brand names appear with 3+ different legal entities, suggesting franchise structures, DBAs, or multi-entity operations.

The onboarding risk: Payment processors may onboard “McDonald's LLC #472” thinking it’s the corporate entity, when it’s actually an independent franchisee with different risk characteristics. Merchant acquirers may verify a storefront showing “Subway” without realizing the legal entity is “John’s Sandwich Holdings LLC” — a structure that obscures beneficial ownership and complicates chargebacks.

Enigma’s Brand Search Model resolves storefront → brand → legal entity, ensuring the correct entity is being onboarded. DBA mapping shows when “John's Pizza” is actually registered as “ABC Holdings LLC” — critical for beneficial ownership verification and UCC filings.

The V2 Playbook: Signals That Close the Gaps

Enigma’s KYB v2 is designed for these edge cases. The core capabilities:

1. Person + TIN Verification

Person verification (including SSN where permitted) to close the sole-proprietor gap
TIN context (EIN vs SSN) so reviewers know whether they’re looking at an entity or an individual

2. Address Intelligence

Registration vs operating address split detection
Virtual/CMRA and residential/commercial flags
Deliverability and suite normalization for accurate matching

3. Brand-from-Location Resolution

Brand Search Model links storefronts to the right brand and legal entity
DBA mapping reveals when operating names differ from legal registrations
Franchise indicators flag when a local LLC operates under a national brand

4. Confidence Tiers + Reason Codes

Match confidence scoring for automated review routing
Reason codes explain escalations (name mismatch, address risk, missing operating signals)
Audit trails make decisions explainable and defensible

5. OFAC Integration

Entity + person screening against OFAC watchlists
In-flow sanctions checks keep compliance in the same workflow as identity verification
Cross-alias matching for sanctions-evasion detection

A Practical Appendix: KYB Failures and the Signals Enigma Uses to Catch Them

.gtm-table { border-collapse: collapse; width: 100%; font-family: Arial, sans-serif; font-size: 14px; } .gtm-table th, .gtm-table td { border: 1px solid #ccc; padding: 10px; text-align: left; vertical-align: top; white-space: pre-wrap; } .gtm-table th { background: #f6f6f6; } .gtm-table th:first-child, .gtm-table td:first-child { width: 28%; } .gtm-table th:nth-child(2), .gtm-table td:nth-child(2), .gtm-table th:nth-child(3), .gtm-table td:nth-child(3) { width: 36%; }

Failure Mode

Enigma V2 Signals

Auto-Review When

Shell merchants(transaction laundering)

Brand-from-location; virtual/CMRA; registration vs operating mismatch; web presence thin; phone/email reuse; high-risk MCC + volume patterns; rapid merchant churn; address density

Operating address missing or virtual-only; MCC + category mismatch; no credible website; contact reuse across many merchants; abrupt volume spike; high refund/chargeback proxy signals

Aggregator sub-merchants(hidden high-risk)

Location→brand/legal step-up; residential/commercial mismatch; web footprint vs claimed product; contact networks (shared phone/email); sub-merchant clustering at same RA/CMRA; rapid formation

No operating signals; recycled contacts across many entities; multiple brands tied to one location/person; thin web presence; high-risk vertical indicators; repeated CMRA

PPP-style mills (mass fraudulent applications)

Person verification (SSN); formation date + operating signals gap; business age; industry mismatch; multi-entity linkage; address density; virtual address; bank account reuse signals (if available)

New entity + mailbox + no operating site + thin web; multiple similar entities sharing contact; anomalous payroll/volume claims vs operating evidence; repeated addresses

Healthcare roll-ups(nominee DME/labs)

Person↔business linkage; operating vs registered addresses; licensure/certification (if integrated); high-risk location clusters; shared phones/emails; rapid NPI-like proliferation proxies; corporate structure complexity

Many entities share RA/CMRA and phones but lack operating evidence; newly formed healthcare entities with high billing volume indicators; address density; inconsistent branding

Sanctions-evasion fronts(trading/trans-shipment)

OFAC screening (entity/person); cross-alias matching; jurisdiction risk; shipping/industry codes vs operating footprint; import/export proxies; complex ownership; rapid address changes

High-risk jurisdictions + thin operating evidence; recent formation; opaque ownership; mismatched industry/location; sanctions hits/near-matches; frequent name/brand changes

Synthetic identity businesses(fabricated owners)

Person ID resolution; SSN/identity validation; name/address/phone coherence; phone age; email domain quality; multi-entity reuse of owner; lack of credit/identity footprint (if integrated)

Owner fails verification; inconsistent person attributes across sources; owner linked to many newly formed entities; virtual address + no operating evidence; disposable email/phone patterns

Nominee directors & straw owners

Graph linkages: person controls many entities; shared addresses/phones; rapid filings; role overlap (registered agent/officer); ownership opacity; same nominee across states

Officer/owner appears across many unrelated entities; RA/CMRA heavy; repeated contact info; mismatch between claimed ops and footprints; complex layers without operations

Cash-intensive businesses misrepresented

MCC/NAICS mapping vs description; location type (residential) vs cash-heavy claims; web presence; reviews/POI presence; hours; signage proxies; high-risk vertical patterns

Cash-heavy vertical claimed but residential address; no POI signals (maps/reviews); MCC mismatch; no operating hours/phone listing; sudden volume spikes

E-comm drop-ship / counterfeit

Website age, content quality; product category risk; brand/trademark mismatch proxies; fulfillment address mismatch; contact networks; customer service signals; jurisdiction risk

Thin/templated website; recently registered domain; high-risk product categories; inconsistent addresses; contact reuse; no returns/support footprint

Charity/NGO diversion

Nonprofit status validation (if integrated); board/officer linkages; donation platform mismatch; address type; web presence; rapid formation; related-party networks

New nonprofit w/ virtual address and minimal online presence; officers tied to many entities; unclear programs; high-risk jurisdiction; mismatch between stated mission and ops footprint

Real estate / escrow misuse

Industry + licensing (if integrated); operating footprint; address type; ownership complexity; multiple entities per address; high-value transaction patterns (if available)

Escrow/real estate claimed with no licensing/operating signals; virtual address; opaque ownership; multiple related entities; high-risk jurisdiction ties

From “Looks Fine on Paper” to “Verified, Explainable, Defensible”

The gap between paper presence and operating reality is where KYB fails. Registration artifacts — an LLC filing, a registered agent, an active status — are easy to manufacture. Operating signals — revenue, reviews, deliverable addresses, phone validation — are not.

Enigma’s KYB v2 closes that gap with multi-signal verification that weights operating presence over paper artifacts. Confidence tiers escalate low-signal entities for manual review. Address intelligence flags CMRA/RA concentration. Person verification handles sole proprietors. OFAC integration keeps sanctions checks in-flow.

The result: fewer false positives, faster true-risk detection, and decisions that are auditable from day one.

Our Methodology

Data source: Enigma’s full “Brands” database, a pre-flattened SQL delivery table combining brand profiles with embedded corporate registrations, addresses, contacts, and card transaction signals. December 2025. Universe: ~32.5 million brand entities.

Scoring

Paper presence (0–3 points): Has registration (+1), registration active (+1), has registered agent (+1).

Operating presence (0–5 points): Card revenue in past 12 months (+1), customer reviews (+1), phone on file (+1), website on file (+1), open location (+1).

Confidence tiers: High (operating ≥4), Medium (2–3), Low (1), Paper-Only (operating=0, paper>0), Unknown (neither).

Classifications

Address clusters (20+ co-located brands):

Likely RA/CMRA: 100+ brands, <30% with revenue
Possible RA/CMRA: 50+ brands, <40% with revenue

Temporal patterns:

PPP era: Incorporated 2020–2021
Zombie entities: 10+ years old, paper score ≥2, operating score = 0
Rapid proliferation: 10+ brands at same address within 3-year window

Franchise detection: Same brand name with 3+ distinct legal entities across 5+ records.

Limitations

Revenue signals are based on card transactions; cash-only businesses may appear as paper-only despite genuine operations. Some industries (religious institutions, cemeteries) legitimately lack card revenue or online reviews. Contact graph analysis (shared phones/domains) was limited by field availability.

Peak Revenue Month: Why Payments Companies Are Seeing 2x Response Rates With Temporal Targeting

Enigma — Mon, 15 Dec 2025 00:00:00 GMT

Businesses don't make purchasing decisions at random times. A tax prep firm generating 70% of revenue between February and April makes virtually all strategic purchases between September and November. Ice cream shops peaking in July plan in January. Ski resorts hitting peak in January plan in July.

This pattern, validated across two years of controlled A/B testing with a major direct mail consultancy, has now been operationalized as a data attribute in Enigma Enterprise. For payments companies specifically, the results are striking: response rates for merchant services campaigns nearly double when timed to a business's planning window.

The Evidence: Two Years of Controlled Testing

Our partner has been testing Enigma data against other providers since 2023. Two findings stand out.

Finding 1: The Six-Month Window

The consultancy retroactively appended peak month information to Enigma records used in demand generation campaigns for business banking. The goal was to identify whether timing relative to a business's peak month affected response rates.

It did. Dramatically.

For merchant account campaigns, businesses contacted 6-8 months before their peak showed indexed response rates of 174-187 against a baseline of 100. Businesses contacted just after their peak (month 12) showed indexed rates of 57. That's a 3x difference in response rates based purely on timing.

The pattern held for total business response as well: 153 indexed at 6 months versus 72 at 12 months. The mechanism is intuitive. Six months out, the upcoming season feels real enough to drive action but far enough to allow implementation. At 12 months, you're pitching to someone who just finished their busy season and isn't thinking about the next one yet.

Finding 2: Enigma Data Quality

In a recent performance review, our partner presented results across five regional banks comparing Enigma data against all other providers. The merchant services category showed a 297% lift for Enigma's transaction-based dataset. Lending showed 346% lift.

But here's what caught our attention: even Enigma's marketable records, businesses flagged as active without full transaction history, outperformed legacy providers. The combination of transaction signals we use to determine marketability creates a baseline that beats what others consider their best data.

Why This Matters for Payments

If you're selling merchant services, payment processing, or business banking products, you're competing for attention with every other vendor trying to reach the same SMBs. Most are sending campaigns continuously, hoping to catch prospects at the right moment by chance.

The data suggests a different approach. Instead of constant nurture campaigns with mediocre engagement, concentrate your outreach on businesses entering their planning window. A landscaping company making equipment decisions in December for their May-September peak. A restaurant evaluating POS systems in January for their summer rush.

The lift compounds. Enigma's core dataset already outperforms competitors by 297% for merchant services. Add temporal targeting on top of that, and you're reaching better prospects at better times.

The Product: Peak Revenue Month as an Enrichable Attribute

Peak revenue month is now available through Enigma Enterprise, calculated from transaction data covering 40% of U.S. consumer credit and debit card volume. For each business identified as seasonal (approximately 214,000 entities at 87% precision), we provide:

Primary Peak Month: The calendar month showing highest transaction volume over the trailing 12 months.

Seasonality Score: How pronounced the seasonal pattern is. Strong peaks mean predictable planning windows.

Planning Window Indicators: Two derived fields, peakmonth-6month for strategic outreach and peakmonth-1month for urgency campaigns.

Unlike weekly triggers that catch immediate signals (new business formed this week), these are longer-range strategic triggers. The -6 month trigger identifies businesses entering their planning window; the -1 month trigger catches urgency as peak approaches.

Access these attributes via our GraphQL API for real-time campaign triggering, through bulk file enrichment to transform your prospect database, or via integration partners like Clay for automated workflows.

Getting Started

Peak revenue month data is available exclusively through Enigma Enterprise. Our team can help you identify which seasonal segments in your market show the strongest patterns and design an implementation approach that fits your campaign operations.

For payments companies, the combination of Enigma's core data quality (297% lift for merchant services) plus strategic timing (response rates 2-3x higher when contacting businesses in their planning window versus just after peak) represents a fundamental advantage over competitors still sending batch-and-blast campaigns.

Contact our enterprise data team at enigma.com/contact-us or speak with your account representative about adding peak revenue month to your existing data package.

Peak revenue month calculations based on 214,000 U.S. businesses. Seasonality identification at 87% precision. Validated across 2023-2025 controlled testing across multiple regional banks.

Lifecycles of Vice Merchants

Enigma — Thu, 11 Dec 2025 00:00:00 GMT

Common wisdom says vice businesses are recession-proof. People drink, smoke, and gamble regardless of economic conditions. But business registration data tells a more complex story about which vice businesses survive and which ones fail.

Our analysis of 853,797 vice industry registrations reveals dramatic differences in business longevity across sub-sectors. Some vice industries show remarkable staying power. Others experience catastrophic failure rates.

What Counts as “Vice”

This analysis focuses on traditional vice industries with substantial federal and state regulation: alcohol production and distribution (breweries, wineries, distilleries, wholesalers, bars, liquor stores), tobacco (manufacturing and retail), and gambling (casinos, racetracks, gaming establishments).

What's NOT included: We excluded adult entertainment (strip clubs, adult stores) and cannabis businesses. Adult entertainment has different regulatory frameworks that make cross-industry comparisons problematic (many strip clubs are classified as bars, for instance). Likewise, because Cannabis remains federally illegal despite state legalization, this creates registration patterns that don't compare meaningfully to other vice categories.

The Survival Landscape

Survival rates for US vice businesses have been trending upwards for years now. Looking at cohorts of newly established businesses and tracking their three-year survival rates, we see that vice businesses established in 2021 had a three-year survival rate of roughly 72% versus just under 60% for those established in 2010.

Of 853,797 total vice industry registrations from 2010-2024:

311,382 registrations (68.0%) are currently active
137,378 registrations (30.0%) are inactive, dissolved, or expired
9,336 registrations (2.0%) have unknown status

But this overall 68% survival rate masks considerable variation by industry type and business model.

The 2015 Cohort: A Natural Experiment

Businesses registered in 2015 are now 9+ years old, providing a natural experiment in long-term viability. These businesses faced normal economic conditions through 2019, then COVID-19 disruption from 2020-2022, then inflation in 2023-2024.

2015 cohort survival rates (9+ years later):

Gambling Industries: 84.1%
Beer and Ale Merchant Wholesalers: 81.0%
Wineries: 78.0%
Racetracks: 73.0%
Tobacco Manufacturing: 71.4%
Breweries: ~75%
Tobacco Stores: ~55%

The breweries and wineries that started in 2015 weathered all three economic shocks better than tobacco retail or small gambling operations, demonstrating structural resilience beyond just “vice always sells.”

Longevity by Vice Category

Survival rates by industry (2010-2024 cohorts):

Top Performers:

Gambling Industries: 77.7% survival rate
Distilleries: 73.5%
Breweries: 73.2%
Wineries: 72.0%
Beer and Ale Merchant Wholesalers: 71.4%
Beer, Wine, and Liquor Retailers: 70.5%
Wine and Distilled Beverage Wholesalers: 70.5%

Lower Performers:

Other Gambling Industries (non-casino): 47.2%
Beverage Manufacturing (general): 63.7%
Tobacco Stores: 64.3%

The pattern is clear: B2B businesses (wholesalers, distributors) outlive B2C businesses (retail stores, bars), while capital-intensive operations (breweries, wineries, distilleries) outlive low-barrier entries (tobacco shops, small gambling operations).

Average Vice Business Lifespan

For businesses that have closed, we can calculate actual lifespan from registration issue date to expiration date. This tells us how long failed businesses survived before closing.

Average lifespan for closed businesses:

Racetracks: 12.9 years
Beer, Wine, and Liquor Retailers: 12.2 years
Beer and Ale Merchant Wholesalers: 12.1 years
Gambling Industries: 11.4 years
Wine and Distilled Beverage Wholesalers: 9.9 years

Even businesses that eventually failed in wholesale and distribution lasted over a decade on average. This reflects the sustainability of business models built on regulatory protection and capital intensity. To hazard a guess, racetracks likely closed due to secular industry decline (competition from casinos) rather than operational failure.

Age of Currently Active Vice Businesses

Looking at businesses that are still operating, some industries show surprisingly high median ages:

Tobacco Manufacturing: 17 years
Beer and Ale Merchant Wholesalers: 16 years
Wine and Distilled Beverage Wholesalers: 16 years
Wineries: 13 years
Racetracks: 12 years

This suggests an interesting paradox: tobacco manufacturing has the oldest surviving businesses, but that doesn't mean high survival rates. It means few new entrants (high regulatory barriers) and old survivors. Contrast this with breweries and wineries, which show both old median ages and high survival rates.

Why Some Vice Businesses Survive

Factors correlating with longevity:

1. Capital intensity: High startup costs create barriers preventing over-saturation. A brewery requires significant equipment, real estate, and inventory. A tobacco store just needs a retail lease and inventory. The difference in upfront capital also creates differences in competitive dynamics.

2. Distribution relationships: Established wholesale networks are hard to replicate. Most states require three-tier alcohol distribution (producer → wholesaler → retailer), legally protecting wholesalers from disruption. Breweries can’t easily switch distributors, and new distributors can’t easily steal accounts.

3. Regulatory moats: License scarcity (gambling, alcohol distribution) protects incumbents. In many jurisdictions, liquor licenses are limited by population ratios or require existing license transfers. This caps market entry.

4. Brand equity: Consumer loyalty for breweries and wineries builds over time. Craft beer and wine consumers show strong preferences for specific brands and local producers, creating sustainable competitive advantages.

5. Real estate ownership: Owning vs. renting determines survival during downturns. Businesses that own their production facilities or retail locations survive economic shocks better than those paying market-rate leases.

When Do Vice Businesses Die?

Survival curves reveal different failure patterns by industry.

Most vice businesses that are going to fail tend to do so within the first five years. But timing varies significantly:

Tobacco retail: Steep early decline, losing 20-25% of businesses in years 1-3. This suggests intense competition and thin margins from day one.
Drinking establishments: Gradual, steady attrition, losing 10-15% every few years. Bars face consistent challenges (competition, rent, changing consumer preferences) rather than acute early failure.
Breweries and wineries: Strong early survival, with failures concentrated after year 5-7. Initial capital investment and brand building provide runway, but businesses that haven't achieved sustainable scale by year 5 face increasing pressure.

This timing matters for investors and entrepreneurs: retail tobacco requires immediate profitability, while breweries can survive initial losses if they achieve scale by year 5.

The Takeaway: Structure Beats Product

Our analysis of vice businesses by sub-industry challenges the myth of uniformly “recession-proof” vice. Some vice businesses thrive due to structural advantages. Others struggle despite selling addictive products.

The difference isn’t what they sell. It’s how they sell it. Distribution beats retail. Capital intensity beats low barriers. Regulation creates moats.

So if you’re betting on vice businesses surviving the next recession, bet on the wholesalers and capital-intensive producers, not the retail storefronts. Demand for alcohol and tobacco may be recession-proof, but business models still determine which specific companies capture that demand profitably.

Methodology

Data Source: Business registration data for vice industries (NAICS codes covering breweries, wineries, distilleries, tobacco manufacturing and retail, gambling establishments, drinking places, beer/wine/liquor retailers, and alcohol wholesalers). Data spans registrations from 2010-2024.

Sample Size: 853,797 registrations. All records include industry classifications.

Industries Excluded: Cannabis businesses (federally illegal, inconsistent state registration patterns) and adult entertainment (different regulatory frameworks).

Survival Rate Calculation: Percentage of registrations with “active” status. Registrations marked inactive, dissolved, expired, or unknown are treated as non-surviving. Unknown status represents approximately 2% of records.

Cohort Analysis: Businesses grouped by registration issue year to control for age effects. Comparing businesses registered in the same year ensures fair comparison. A 2010 business has had 14 years to fail, while a 2020 business has had only 4 years.

Lifespan Calculation: For closed businesses with both registration issue dates and expiration dates. Calculated as days between issue and expiration, converted to years. Limited to lifespans between 0 and 50 years to exclude data errors. Only industries with 100+ closed business records shown for statistical reliability.

Statistical Thresholds: Industries and cohorts with fewer than 50-100 registrations excluded from specific analyses to ensure meaningful results. Exact thresholds noted in each analysis section.

Age Calculations: For currently active businesses, age calculated as years from registration issue date to present (December 2024). Only businesses registered 1990 or later included to avoid extremely old registrations that may represent data artifacts.

Geographic Note: State-level geographic analysis proved unreliable due to businesses registering in one state (often Delaware for tax purposes) while operating in another. Geographic findings excluded from this analysis.

Date of Analysis: December 2024. Survival status reflects business status as of data extraction.

Limitations:

Registration status may lag actual business operations (businesses may have ceased operations before formal registration expiration)
Expiration dates not available for all inactive businesses, limiting lifespan analysis to subset with complete date records
Industry classifications based on NAICS codes, which may not capture all business activities (e.g., a bar that’s also a restaurant)

The New Enigma KYB: More Automatic Approvals, Fewer Manual Reviews

Enigma — Tue, 09 Dec 2025 00:00:00 GMT

import KybVerificationAnimation from "@components/blog/KybVerificationAnimation.astro";

A landscaping company applies for a business checking account. Three days later, they open one with your competitor.

What happened? They submitted "Green Thumb Landscaping" but they're registered as "GTL Services LLC." Your KYB provider saw a name mismatch and routed to manual review. By the time an analyst confirmed they're the same business, the customer was gone.

Your manual review queue is larger than it needs to be. Not because of fraud. Not because of risk. Because traditional KYB providers compare lines of text instead of understanding businesses.

Enigma KYB API changes that. Entity resolution that matches operating names to legal entities. SSN verification for sole proprietors. Principal identity checks. Address intelligence that catches fraud. We auto-verify submissions that other providers can't touch.

The Numbers: What Entity Resolution on Business Identity Actually Delivers

In customer evaluations comparing V1 to V2 of the Enigma KYB API:

+13 percentage points improvement in name + address auto-verification rates
99% precision on legal entity matches with high-quality input data
92% website fill rate when brand entities are matched (vs. 11% with legacy approaches)
~1 second median response time

Other KYB providers benchmark auto-approval at 50%. They check registered businesses against Secretary of State records. That's it.

Enigma delivers ~70% auto-verification because we go beyond SOS. We verify sole proprietors who file under SSNs. We match DBAs to legal entities. We connect operating names to state registrations even when the strings don't match.

It's not a marginal improvement. It comes from a fundamentally different approach predicated on understanding businesses.

Beyond Secretary of State: Why Entity Resolution Changes KYB

Traditional KYB providers compare lines of text. Enigma understands businesses. Our entity resolution technology automatically connects:

Operating names (DBAs, brand names, trade names) to legal entities (state registrations)
Sole proprietor identities (name + SSN) to business licenses and tax records
Officer names to Secretary of State filings, matching principals against the authoritative record
Registrations from all 52 US jurisdictions, updated continuously

The result: matches that others miss, at scale.

For the 30% That Need Review: Context, Not Data Gathering

Auto-verification handles most submissions. But what about the rest?

Competing providers return pass/fail on basic datapoints (name, address, EIN, ownership) and leave analysts to chase down context from disparate sources. Enigma returns the complete business identity package:

Secretary of State filings with registration status and officer names
Website verification for legitimacy signals
Industry classification for risk-appropriate decisioning
Alternative addresses when the primary doesn't match
Transaction signals showing actual business activity

Analysts become decision-makers, not data gatherers. The submissions requiring review get resolved faster because the context is already assembled.

What's New: Verifying Sole Proprietors, Principals, and Risky Addresses

Sole Proprietors: SSN Verification

According to the SBA, 82% of US businesses operate without employees. The vast majority are sole proprietorships. These businesses don't have EINs. They file taxes under their Social Security Number.

Until now, this created a verification dead end. No EIN meant no automated path. Rejecting all unregistered sole proprietorships means turning away the majority of American businesses.

The SSN verification task validates submitted SSNs against IRS records. For contractors, freelancers, gig workers, and consultants, this unlocks automated verification where manual review (at $20-50 per case) was previously required.

Principals: Person Verification

Business legitimacy isn't just about the entity. It's about who's behind it. The person verification task matches submitted names against officer records on Secretary of State filings, connecting business identity to human identity automatically.

For beneficial ownership requirements and fraud prevention, this adds a verification layer without manual investigation.

Risky Addresses: Intelligence That Catches Fraud

New address attributes surface risk signals that would otherwise require manual investigation:

Virtual address detection: A business claims headquarters at a UPS Store or Regus location. The virtual flag automatically identifies Commercial Mail Receiving Agencies. No analyst lookup required.

Deliverability validation: An applicant submits an address that doesn't exist or can't receive mail. USPS delivery point validation catches it before you send welcome materials to nowhere.

Residential vs. commercial: A "commercial trucking company" lists a residential apartment as headquarters. The rdi classification surfaces the mismatch for risk review.

These signals feed directly into decisioning logic. Fraud patterns that took analysts 15 minutes to investigate now resolve in milliseconds.

Two tiers to fit your compliance needs

Identify: Business entity matching, address verification, data enrichment. For marketplace onboarding, SaaS customer verification, pre-fill workflows.

"Instacart" is registered as "Maplebear Inc." "Brooklyn Pizza" operates as "JPR Holdings LLC." A freelance consultant does business as "Sarah Chen Consulting" but files taxes under her SSN.

Verify: Everything in Identify + registration status, person verification, full audit trails. For compliance work in payment processing, lending, and business banking.

Both packages support add-ons: TIN verification, SSN verification, OFAC watchlist screening.

Check out the documentation to learn more.

Getting Started

Enigma KYB API integrates natively with Alloy, Taktile, Oscilar, and Hummingbird. Or integrate directly via REST API.

Contact our team to start the conversation.

Enigma KYB API: ~80% coverage, ~70% auto-verification rates, ~1 second response times. Entity resolution that matches operating names to legal entities across all 52 US jurisdictions.

The Geography of Corporate America

Enigma — Tue, 25 Nov 2025 00:00:00 GMT

Observing the Landscape of Registered Business Addresses

Look up a company’s “registered address” and you might expect a headquarters or office. In reality, a tiny slice of locations do outsized work as the supposed homes of American corporations. Inside Enigma graph-model-1, just 128,124 addresses (0.21% of all unique addresses) are linked to about 106.6 million registrations — and those high-throughput addresses account for about 26.6% of the 400.4 million registrations we observe. That footprint belongs to the registered-agent (RA) infrastructure: the legal mailboxes corporate America relies on.

The concentration of registered addresses

Two Delaware locations host by far the most registrations. But if you drive by these offices, you won’t see millions of employees. Instead, these are industrial-scale RA operations handling service of process and official notices for companies incorporated nationwide. But this is not just a Delaware story — the RA pattern repeats across states. Let’s look at the locations hosting the largest number of corporate registrations in America.

Top 10 most concentrated addresses (all states, by linked registrations):

251 Little Falls Dr, Wilmington, DE — 2,106,681 registrations

1209 N Orange St, Wilmington, DE — 2,079,340 registrations

28 Liberty St, New York, NY — 936,362 registrations

1025 Capital Center Dr, Frankfort, KY — 860,412 registrations

2710 Gateway Oaks Dr, Sacramento, CA — 845,562 registrations

1200 S Pine Island Rd, Fort Lauderdale, FL — 836,218 registrations

2 N Jackson St, Montgomery, AL — 730,143 registrations

1999 Bryan St, Dallas, TX — 703,857 registrations

1201 Hays St, Tallahassee, FL — 673,762 registrations

600 W Main St, Jefferson City, MO — 660,060 registrations

Half of the top ten addresses sit in state capitals — Sacramento, Tallahassee, Montgomery, Jefferson City, and Frankfort — reflecting a mix of RA offices co-located near agencies and state service-of-process infrastructure. That pattern is consistent with legal logistics — the advantages of being near agencies and courts, and in some cases interacting with state service-of-process channels. It’s a reminder: registration is about legal routing, not customer foot traffic.

Where RA markets are most consolidated

We estimate consolidation by registrations per high-throughput address (states with ≥100k total registrations in our lens):

Delaware — 91,987 per address (81 addresses, ~7.5M registrations)
Wisconsin — 7,415 per address (104 addresses, ~771k)
Michigan — 6,478 per address (267 addresses, ~1.7M)
South Carolina — 2,739 per address (540 addresses, ~1.5M)
California — 1,539 per address (3,705 addresses, ~5.7M)

Delaware’s density is an order of magnitude higher than any other large state—unsurprising given its long-standing corporate law ecosystem and the network effects around it.

Why this matters (and what it doesn’t tell you)

Registered address ≠ operating location.

A San Francisco startup, an Ohio restaurant group, and a New York hedge fund can all share a Wilmington mailbox. Registration addresses are poor proxies for economic activity, which is why Enigma pairs them with operating-location signals (payments, payroll, permits, geospatial footprints) when we need to map real-world presence.

The RA industry has scale economics.

A handful of large providers (e.g., CSC, CT, Registered Agents Inc., others) handle massive volumes with standardized processes. That scale shows up in the data as extreme concentration at a relatively small set of addresses.

The RA industry has broader implications.

Economic development — The registry map is a story about legal gravity, not where commerce happens. Delaware and several capitals show concentrations because they collect paperwork, not payroll. In Enigma’s data, that contrast is visible the moment you look beyond the mailbox: the legal layer concentrates into a few super-nodes, while operating signals disperse across retail corridors, logistics belts, and office parks. Read together, the two layers explain why a tiny building in Wilmington can “house” millions of companies even as the day-to-day economy shows up somewhere else.
BI & risk — Registered-agent hubs behave like high-degree nodes in a business identity graph. They’re useful landmarks — thousands of entities touch them — but they can also distort a naïve map of customers or counterparties. Enigma’s unified profiles separate the legal routing address from the places where a business actually appears in the world, so the cluster becomes context rather than confusion. That shift makes patterns legible: roll-ups that share an agent, SPVs that live entirely on paper, and operating brands that share a legal backbone but diverge in footprint.
KYB/AML — A crowded RA address is ordinary — the interesting signal is the pattern around it. In Enigma’s KYB lens, the narrative changes when an entity at a super-node also lacks foreign registrations where activity should exist, shows no operating traces, or rotates agents quickly while ownership overlaps proliferate. The inverse is just as telling: long-lived holdings with stable filings and corroborating activity look exactly like the legitimate infrastructure the RA universe was built to serve. The phenomenon isn’t the mailbox — it’s what the rest of the data says once the mailbox is accounted for.

Methodology

Source: Data gathered from Secretary of State filings for all U.S. states (and applicable territories), normalized by Enigma. StreetView images by Google.
Lens: We focus the RA view on addresses linked to ≥45 registrations—a threshold that captures professional RA sites and filters out single-tenant locations. This “high-throughput” slice contains ~106.6M registrations tied to ~128,124 addresses. The full registry covers ~400.4M registrations across ~61M addresses.
Address standardization: We deduplicate on a normalized full-address string (street1, city, state, ZIP). Minor formatting differences and suite variations can under/over-roll up a physical site.
Timeframe: Active and historical registrations through 2024; historical entries for dissolved entities remain, so current RA concentration may be lower.
**Caveats:

**Registration address is not an operating location.
- We cannot always attribute a given address to a specific RA firm without separate attribution logic.

Several top addresses sit in multi-tenant or government buildings that aggregate filings.

Slice Slice Baby

Enigma — Thu, 13 Nov 2025 00:00:00 GMT

We examined thousands of pizza ratings and tracked their impact through revenue, foot traffic, and geography.

It’s a simple premise: rate pizza on a scale of 1 to 10, film it in 90 seconds, post it online. Launched by Barstool Sports founder and self-appointed “El Presidente” Dave Portnoy, OneBite Reviews has turned this formula into one of food media’s most influential platforms, with thousands of restaurants receiving numerical verdicts that reach millions of viewers (plus a private jet and army of trolls, according to the New York Times).

But influence and impact aren't the same thing. Does a high score translate to actual business outcomes? Or is it just momentary internet fame?

We matched OneBite's review database to Enigma’s business intelligence data—revenue transactions, foot traffic patterns, operating status—to measure what actually happens after a pizza place gets this high-profile review treatment. The results display how viral media creates measurable economic effects, and why some restaurants survive while others close.

The Value of a Top Review

The impact of impressing Portnoy is clear: restaurants scoring 8.0 or higher see revenue jump by an average of 90.2% in the six months following their review. That’s not a rounding error or a modest bump. It’s nearly doubling their business.

Even restaurants scoring in the 7.0-7.9 range—solidly above average—see a 17.1% revenue increase in the following six months. Across all scores, 60% of reviewed venues experience some level of revenue growth, averaging 30.8%. The timing is clear in our data: revenues tend to remain stable in the six months before a OneBite review, then spike immediately after the review video drops. The pattern repeats across hundreds of restaurants in different states, cities, and market conditions.

Foot traffic tells the same story. Daily customer counts jump by 68.9% on average after a OneBite review. A good score doesn’t just generate comments—it drives business. But this boost is not evenly distributed: only a slight majority of restaurants (54.3%) see more people walking through the door in the months that follow.

Geography: The Northeast vs. Everywhere Else

Onebite ratings have a clear geographic skew toward the Northeast—both in volume and high achievement. Connecticut leads these reviews with a 7.69 average score across 72 reviews. New Haven, famous for its coal-fired pies, averages 7.75—the highest-rated city with a meaningful sample size. The Northeast corridor forms a wall of excellence: Connecticut (7.69), New Jersey (7.47), Massachusetts (7.30). Yet this Northeast emphasis also belies Portnoy’s Boston origins, reflected in an unusual concentration and high evaluation of Massachusetts eateries.

Then there's the rest of the country. Kentucky averages 5.43. Alabama: 5.84. West Virginia: 5.87. The gap between Connecticut and Kentucky isn't marginal—it's a chasm that spans the entire rating scale. But Kentucky's problems go deeper than low scores. It's also the most inconsistent state (standard deviation: 2.71). To go by these OneBite ratings, you're just as likely to find decent pizza as you are to find something that barely qualifies as food.

Meanwhile, Maryland and Arizona prove the Northeast doesn't have a monopoly on quality. Both states maintain high scores and consistency—reliably good pizza for states that rarely attract this reputation. Despite these few surprises, pizza quality concentrates in regions with generational knowledge, ingredient access, and cultural standards

New York City: A Varied Pizza Landscape

OneBite has reviewed 78 venues in Brooklyn with an average score of 7.67. No other location in our dataset comes close to combining that much volume with such high quality. Plenty of cities have a handful of excellent shops. Plenty have a wide selection of average ones. Brooklyn is the only market that’s both majorly represented in OneBite reviews and also consistently good, with dozens of highly rated pizzerias.

Across the Verrazano Bridge, Staten Island edges Brooklyn slightly in average score (7.73), but with less than a quarter of Brooklyn’s review count. If you’re hunting for impressive outliers, Staten Island is a solid bet for good pizza off the beaten path. But Manhattan is the inverse story: more volume, weaker scores. As the epicenter of the civilized universe, Manhattan has plenty of reviewed venues, but the ratings sag as you move into heavier tourist zones and high-rent neighborhoods where pizza is optimized for foot traffic and convenience. There’s a clear tradeoff in the numbers for Manhattan pizza: volume and ease of access over peak quality.

The Extremes: Perfect Tens and Absolute Zeros

Extremes often make the best stories. At the high end of OneBite ratings, Monte’s Restaurant in Lynn, Massachusetts achieved a perfect 10, suggesting that this small city north of Boston is a hidden mecca of pizza excellence. Trailing just behind are DeLucia's Brick Oven Pizza (New Jersey), Frank Pepe’s (New Haven), and Di Fara (Brooklyn) each earned 9.4 ratings. Then there are the lowest achievers. At the bottom: Café Muse (New York) and Blaze Pizza (Los Angeles) both received 0.0 scores. These are zeroes. Not “bad,” not “disappointing.” Zero.

In terms of general trends, the median score across all OneBite reviews is 7.3, meaning half of all pizza ratings fall below this threshold. The scale is genuinely harsh. A 7.0 isn’t mediocre—it’s a solid B in a system where C’s are common and F’s are plentiful. This distribution matters because it shows the scores have real variance. If everything clustered around 7.5, the ratings would be meaningless. Instead, the full scale is in use, from perfection to catastrophe. Although plotting the ratings over time suggests that OneBite’s ratings are both trending upward and concentrating in the good-not-great range.

The Real Value of Reviews

Our analysis isn’t about one reviewer's influence—it's about how information asymmetry works in local restaurant markets. Before viral food media outlets like OneBite, finding quality pizza required local knowledge, word-of-mouth, and maybe a little trial and error. Meanwhile, many high-quality restaurants in secondary markets languished in obscurity, while many bad restaurants survived on location and inertia.

Widespread access to reviews can ameliorate that friction. These reviews surface quality, punish mediocrity, and accelerate market efficiency. And the effect is measurable using Enigma’s business data: revenue spikes, foot traffic surges, and closure rates correlate with scores.

The bump that comes with a OneBite review is what you get when millions of people have reliable information about local product quality. Then we can use data gathered about those businesses to measure what happens next. Connecticut makes some of the absolute best pizza in America. Kentucky seems to make the worst. Brooklyn pizzerias are the most reliable across the board. And viral media clearly sells a lot of slices and pies.

Methodology

Our analysis is based on 1,868 OneBite pizza reviews matched to business intelligence data (revenue, foot traffic, operating status) from 2013-2025.

Non-pizza establishments (like McDonald’s and Panera Bread) have been filtered out along with data anomalies (such as businesses with strikingly low levels of card revenue, which are probably cash only).

We measured revenue impact by comparing 6-month averages before and after review dates.

Coverage: 38 states, 567 cities (foreign businesses, mostly in Italy and Canada, have been filtered out).

The Michelin Effect

Enigma — Thu, 30 Oct 2025 00:00:00 GMT

What 1,700+ Restaurants Reveal About Fine Dining Economics

In the mythology of fine dining, earning your first Michelin star changes everything. It's supposed to be the moment when a restaurant transcends from “excellent neighborhood spot” to “destination worthy of planning your entire weekend around.” The star should unlock pricing power, media attention, celebrity chef status, and months-long waitlists that insulate you from the typically brutal realities of restaurant economics.

But when Enigma analyzed transaction data from 936 Michelin-recognized restaurants across the United States — roughly half the establishments in major markets like New York, San Francisco, Los Angeles, Chicago, and Washington — a more nuanced picture emerged.

Among restaurants in our dataset, those with one Michelin star generate median annual revenues just 15-25% higher than restaurants with no stars at all.

This isn’t the dramatic leap you’d expect from one of the culinary world’s most coveted honors. The gap between starred and non-starred establishments is surprisingly compressed, revealing something fundamental about how prestige translates into revenue.

The Revenue Ladder (With a Surprising Middle)

Using 12-month card transaction data ending in July 2025, here’s what the revenue hierarchy looks like for the restaurants we can observe.

At the apex, three-star restaurants operate in a different economic reality. With median annual revenues of $3.5 million — and top performers like The French Laundry and Eleven Madison Park exceeding $10 million — these establishments have achieved the rarest feat in fine dining: combining ultra-premium pricing with sustained, global demand.

But three-star status is so scarce that it’s essentially a separate world. For the other 929 restaurants with some kind of Michelin status, the relevant question isn’t “how do I get to three stars?” It’s “what's the business case for pursuing even one star?”

The expected pattern holds at the top: two-star establishments generate $1.9 million in median annual revenue. But then something counterintuitive happens in the tiers where Michelin places restaurants they choose to recognize despite withholding a star rating.

One-star restaurants pull in $1.4 million annually. Meanwhile, Bib Gourmand establishments — recognized by Michelin for “good food at moderate prices” — earn a very close $1.2 million. That's just a 17% gap. And Michelin’s ‘Selected’ tier of restaurants, deemed notable but not star-worthy by inspectors, earn $1.1 million.

For an industry where a single star can define a chef's entire career, the financial premium is surprisingly modest. Here's the insight: the jump from one star to two stars matters more financially than it would matter for a Bib Gourmand or Selected restaurant to attain a single star.

The reason? Bib Gourmand restaurants serve 4-5x more customers than one-star restaurants. They compensate for lower prices with significantly higher volume, reaching annual revenues that approach — and in dozens of cases actually exceed — their starred counterparts.

For most restaurants operating below the ultra-premium tier, the path to revenue growth may have less to do with chasing stars and more to do with understanding their strategic positioning: volume versus pricing, accessibility versus exclusivity.

Geographic Reality: Location Shapes the Star Premium

The value of Michelin recognition varies significantly by market maturity.

In established Michelin markets (New York, San Francisco, Los Angeles, Chicago, Washington) — where the Guide has operated for many years — the fine dining ecosystem is sophisticated. Customers understand what “Bib Gourmand” and “Selected” mean, and actively seek them out. In our sample from just these cities:

One-star restaurants: $1.5M median revenue
Selected restaurants: $1.3M median revenue
Just a 15% gap in secondary markets (everywhere else), star recognition provides clearer competitive differentiation. In our sample from these cities:
One-star restaurants: $1.8M median revenue
Selected restaurants: $1.1M median revenue
A more substantial 64% gap

When educated diners can distinguish between Selected, Bib Gourmand, and starred establishments, operational excellence matters as much as formal recognition. In less mature markets, the star itself carries more signaling value.

What Revenue Data Can’t Tell Us: The Hidden Value of Stars

Revenue figures only capture part of the story. A one-star restaurant generating $1.4 million with 16 customers per day likely has vastly different economics than a Bib Gourmand generating $1.2 million with 72 customers per day.

Operating leverage: Fewer customers means lower labor costs, simpler logistics, and potentially higher margins per customer served. More customers means economies of scale in purchasing, but higher operational complexity.

Demand stability: One-star restaurants often maintain multi-month waitlists, effectively eliminating customer acquisition costs and smoothing seasonal volatility. Non-starred restaurants may need continuous marketing spend to fill seats.

Talent attraction: Michelin stars attract top culinary talent and provide career validation that extends beyond revenue. These benefits matter for long-term sustainability and professional satisfaction but don't appear in transaction data.

Optionality: A chef with a Michelin star unlocks opportunities that revenue alone doesn't: cookbook deals, consulting projects, media appearances, and better leverage when negotiating partnerships or expansions.

Our data measures revenue. It cannot measure margin, stability, culture, or career trajectory — all of which may justify the pursuit of a star even without proportional revenue gains.

The Business Case for Pursuing a Star

So should a restaurant pursue a Michelin star, or just shoot for the lower tiers of Michelin recognition? The answer depends on what you’re optimizing for.

If you’re optimizing for revenue maximization, then operational scale, customer volume, and strategic positioning may matter more than critical acclaim. Many high-revenue Bib Gourmand and Selected restaurants have built remarkably successful businesses without stars.

Then again, if you’re optimizing for margin and creative control, then the one-star model — smaller scale, higher prices, chef-driven menus — might generate less absolute revenue, but could offer better margins, more creative autonomy, and less operational complexity.

Yet if you’re optimizing for prestige and career trajectory, climbing the ladder of Michelin recognition opens doors beyond high revenue and hometown adoration. The first star establishes serious credibility. The second star puts you in rarefied air. The third star is legendary.

The restaurants that succeed at both — culinary excellence and business performance — are rare precisely because those two goals often require different trade-offs. For everyone else, the choice is real: chase the star and accept the operational constraints, or else build a thriving business without it.

Methodology

Our Dataset

This analysis examines credit card transaction data from 936 Michelin-recognized restaurants in the United States, representing approximately 50% of establishments in major markets covered by the Michelin Guide. This data excludes cash (typically <5% of revenue at fine dining establishments), catering, events, merchandise, or licensing. Matching was performed via name, location, and business characteristics through Enigma's commercial transaction intelligence platform. Our sample includes:

534 Selected restaurants
283 Bib Gourmand restaurants
102 One Star restaurants
10 Two Star restaurants
7 Three Star restaurants

Data covers the 12-month period ending July 31, 2025, derived from card transactions matched to Michelin Guide listings.

What We Can Describe

Revenue patterns among restaurants in our matched dataset
Operational differences between volume-driven and premium models
Geographic concentration in major US markets
Competitive dynamics where we have sufficient sample sizes

What We Cannot Describe

The complete Michelin ecosystem — we’ve observed a little over half of US restaurants with some form of Michelin recognition.
Causal effects of earning stars — we have cross-sectional data, not before/after comparisons
Profitability — revenue doesn't reveal costs or margins
Future trends — this is a point-in-time snapshot, not a longitudinal study

Any patterns we observe reflect the restaurants we can see, which may differ from those we cannot. Statistical reliability varies by category: findings for Selected, Bib Gourmand, and One Star restaurants are robust (n=102-534), while Two Star (n=10) and Three Star (n=7) findings are suggestive but limited by small sample sizes.

Spirits of Retail Past

Enigma — Thu, 09 Oct 2025 00:00:00 GMT

Halloween Stores are the Graveyards of American Commerce

Happy fall, and welcome to the Halloween season! For today’s dive into American commerce, we’re looking at how seasonal pop-up stores have become a kind of zombie feeding on dead real estate. Every autumn, this dark ritual unfolds across American strip malls. Orange and black banners appear overnight in vacant storefronts. Skeletons and witches materialize in windows that have been pitch dark for months. Halloween stores have arrived, ready to profit from retail's slow-motion collapse.

Our analysis of six years of Spirit Halloween location data (2019-2025) shows a sophisticated operation that has turned so many retail failures across the country into a business model that works exceptionally well for a few months a year. The numbers tell a story of strategic opportunism: 78% of recent Spirit Halloween stores are located in former retail spaces, transforming the corpses of bankrupt chains into a multi-hundred-million-dollar seasonal empire.

Enigma has peered into an unholy dataset revealing the dark economics of retail necromancy, measured in square footage and failed leases, told through the lens of the country's most successful Halloween pop-up chain.

The Rise of the Retail Vampire

Spirit Halloween has grown dramatically over the past six years, expanding from a few hundred seasonal locations to a network of over 1,400 stores by 2024. But the real story isn’t growth — it’s how they’re growing. Unlike traditional retailers building new infrastructure, Spirit Halloween has mastered the art of opportunistic real estate acquisition.

The pattern is unmistakable: Spirit Halloween scouts for recently-closed retail locations with the right square footage, secures a short-term lease at distressed rates, and transforms the space within weeks. They’re not competing in the retail real estate market, but rather reanimating necrotic storefronts as an undead legion of commerce.

They’re not alone in this space — at least not yet. Halloween City, once Spirit Halloween's primary competitor, operated 254 stores at its peak in 2019. By 2024, Halloween City had completely disappeared. The data shows a brutal consolidation: as Spirit Halloween expanded, Halloween City contracted at an almost identical rate. Falling from 254 stores to zero in just four years, Halloween City’s demise tells its own story about the winner-take-all dynamics of seasonal retail.

The Retail Graveyard

Look at the names on Spirit Halloween's real estate portfolio and you're reading a roll call of retail failures. Sears and Bed Bath & Beyond lead the list, with hundreds of former locations that are now Spirit Halloween stores. Former Toys “R” Us locations, once massive commercial destinations for families, now sell animatronic zombies instead of action figures. Party City, itself struggling with bankruptcy, has seen multiple locations flipped to its seasonal competitor.

The top victims tell a familiar story: they are big-box retailers that couldn’t adapt to e-commerce, specialty stores that lost relevance, and chain restaurants that overexpanded. Our analysis of 2023-2025 Spirit Halloween locations reveals they’ve absorbed space once inhabited by at least 50 different retail chains.

Some brands appear repeatedly: multiple former Kmart locations (themselves victims of an earlier retail apocalypse), numerous defunct Sports Authority stores, and a surprising number of former supermarkets that couldn’t compete with Walmart and Amazon’s entry into the grocery store market. Each represents not just a failed business, but a failed bet on consumer behavior, physical retail, or simple overexpansion.

The economics are brutal but simple: Spirit Halloween pays a fraction of what these spaces once commanded for year-round leases, often securing three-month deals at deeply distressed rates. Landlords accept these terms because vacant properties generate zero revenue, create maintenance liabilities, and signal distress to other tenants. Spirit Halloween’s arrival turns a liability into a modicum of cash flow, even if it’s only temporary.

Strategic Optimization: When the Spirit Moves

Spirit Halloween doesn’t just expand—it constantly optimizes its portfolio of locations. Using store tracking data across multiple years, Enigma can measure exactly what happens when Spirit Halloween closes a location and opens a new one nearby. The patterns reveal sophisticated real estate strategy. A full 25% of Spirit Halloween closures aren’t really closures at all — they’re relocations within the same shopping center, literally moving from one dead retailer to a neighboring one. Another 47% are local optimizations, moving within 5 miles to upgrade to superior locations as better retail corpses become available.

The median migration distance is just 1.7 miles, showing that Spirit Halloween rarely abandons a market entirely. Instead, they're constantly trading up, moving from adequate spaces to ever more optimal ones as ongoing retail failures create new opportunities. According to our data, only 3% of store movements represent true market exits.

This reveals something important: Spirit Halloween isn’t just passively filling vacant retail. They're actively curating a portfolio, abandoning less desirable locations as soon as better options appear. This amounts to retail real estate arbitrage, executed at scale, with the holiday calendar providing ample time to maneuver for these annual location shuffles.

The Geography of Retail Decay

The retail apocalypse isn’t uniformly distributed, and neither is Spirit Halloween. Our state-level analysis reveals where retail consolidation hits hardest. States with the highest concentration of Spirit Halloween stores in former retail spaces — often exceeding 80% — tend to be those that saw the most aggressive big-box retail expansion during the 1990s and 2000s, followed by the sharpest contractions in the 2010s.

California and Texas lead in absolute store count with 100+ Spirit Halloween locations, but the percentage of those in former retail spaces varies dramatically by market. Urban markets with diverse retail ecosystems show lower percentages, while suburban markets that were heavily dependent on now-defunct chains show the highest rates of retail repurposing.

And the pattern isn’t random. Spirit Halloween concentrates in middle-income suburban areas — precisely the markets that supported the big-box retail explosion of the 1990s and now suffer most from its collapse. These are communities built around car-dependent shopping centers, where retail square footage far exceeds sustainable demand in an e-commerce era.

What This Means

Spirit Halloween’s business model is both brilliant and deeply revealing of the American retail landscape. By operating only during the most profitable season and leveraging distressed real estate, they’ve found a way to profit from what defeats others: the overcapacity and declining foot traffic that plague year-round retailers.

But Spirit Halloween isn’t causing the retail apocalypse. In a way, they’re documenting this collapse with their shifting nationwide footprint. Every new store location is a marker of retail failure, a data point in the ongoing transformation of American commercial real estate. The 1,400+ Spirit Halloween stores operating in 2024 represent so many stories of over-expansion, changing consumer behavior, and the brutal economics of physical retail in an Amazon-dominated world.

The important question is what happens next. Spirit Halloween can only expand as long as retail keeps failing. If American retail ever stabilizes — that is, if the pace of store closure slows — Spirit Halloween’s growth model would also face fundamental limits.

For now, though, the retail apocalypse continues, and every October, Spirit Halloween returns to remind us exactly which retailers didn’t survive another year. The skeletons in those windows aren’t just Halloween decorations — they're the lingering ghosts of American retail.

For the nerds: follow along with Enigma’s data analysis for this article.

https://newsletter-interactive-2025-10-spirit-halloween-ff54de.gitlab.io/

The Hidden Economics of Campus Life, Part 2

Enigma — Thu, 25 Sep 2025 00:00:00 GMT

Leaders, Outliers, and Our Methodology

Building on Part 1 of this series, our analysis of campus commerce uncovers which businesses thrive in the collegiate ecosystem, from the predictable dominance of fast food chains to the surprising revenue power of campus laundromats and grocery stores. Digging into transaction data from thousands of campus businesses reveals a sharp economic divide between elite institutions and the rest of the pack, with purchases totaling nearly 11 times more at campus businesses on private university campuses than their public school counterparts. In this installment, we focus on the institutions and industries that stand at the extreme edges of our campus spending data.

Campus Leaders

Stanford University tops the total spending ranks with $937M in annual credit card purchases at businesses based on their 8,180-acre campus (the largest in the United States and second largest in the world). The sheer scale of their campus economy towers over most other college towns: Stanford businesses average $78.38 per transaction—more than double the national campus average of $37.75. Dartmouth College, another geographically isolated campus, also shows consistently elevated transaction prices averaging $57.01. These remote academic institutions effectively operate as economic islands where limited competition allows campus businesses to corner a captive audience of students and staff.

The UCLA campus leads annual parking revenue with $14.63M, amounting to an average of $2.44M per parking lot across their six facilities in the Westwood neighborhood of Los Angeles. UCLA’s main "Parking and Information Kiosk" alone processes 910,851 transactions annually (2,495 per day). This massive volume reflects the car-dependent culture of Los Angeles, where public transit options remain limited compared to other major cities. Among all universities, UCLA's parking revenue is 15 times the national average ($0.97M). The parking operation generates more annual transactions than most campuses see across all businesses combined, with an average transaction of just $16 suggesting heavy reliance on hourly and daily passes rather than semester permits.

Food and Restaurant Spending

Pizzerias on the Fordham University campus lead the nation with an average of 23 credit card transactions per student each year. Including cash purchases of pizza slices around their main campus in the Bronx could easily double this figure. Across the top ten schools for pizza orders nationwide (that is, the biggest pizza spenders), other campus pizza joints average just 13.9 orders per student — nearly half the number at Fordham. Delivery orders from off-campus pizzerias would certainly push this figure even higher. As pizza-loving New Yorkers, Enigma salutes you, Fordham University.

Another leader in sheer volume of fast food spending, the UCLA community spent $6.3M on pizza and $9M on burgers last year— and 95% of the burger sales were just at In-N-Out. California also leads the nation in total pizza spending with over $36M statewide. Yet despite the stereotype of college life fueled by late-night pizza deliveries, even this massive figure represents just a fraction of total food spending on the UCLA campus. The highest-earning grocery store on any college campus last year was UCLA’s Whole Foods, whose revenues were nearly $35M according to our data on card transactions.

Campus grocery stores, in general, became absolute goldmines during the Covid pandemic and remain highly lucrative even today. Analyzing 94 grocery stores across 65 different campuses reveals average revenues of $3.9M per location. Alongside UCLA, the Dartmouth grocery store stands as a massive outlier, generating $27.4M annually—likely benefiting from a rural New Hampshire setting where it serves as the primary food source for thousands of students. Likewise, given that many customers at the UCLA Whole Foods may have no connection to the university, the Dartmouth grocery store stands as an especially remarkable case.

Overall, the top 10% of campus restaurants generate 46% of food revenue, demonstrating that student dining follows a power law distribution where a handful of popular establishments dominate. National fast food chains represent the apex of this ecosystem, with individual Chick-fil-A locations sometimes earning over $8M in annual revenue while comprising a full 25% of total campus dining cashflow. In a surprising twist, pizzerias earn just 4.1% of total campus dining revenue nationwide — a surprisingly low figure that also challenges stereotypes about pizza-fueled college life.

On the other hand, caffeine dependence is a college stereotype that holds up: Coffee shops near campus libraries see 150% revenue increases during finals, with transaction volume data confirming that the end-of-term crunch is inevitably a caffeine-fueled enterprise. These strategic locations capitalize on proximity to study spaces, turning academic stress into revenue spikes.

Other Surprises

Laundromats stand out in our dataset as reliable cash cows, with campus laundry businesses averaging $1.2M annually and often outperforming restaurants in per-student revenue. While these campus laundromats generate only $2.76 per student annually, they do so with minimal labor costs, no inventory beyond detergent, and self-service operations that run 24/7. The captive market and essential nature of laundry create a recession-proof business model that thrives regardless of broader economic conditions.

Likewise, despite the industry-wide decline of campus bookstores (see the first installment of this series), several flagship institutions defy the trend through strategic pivots to branded merchandise. Ohio State leads nationally with $7.63M in campus bookstore revenue, while the University of Washington ($7.47M) and University of Virginia ($7.02M) round out the top three—all generating 29x the industry median of $0.26M. For these businesses to thrive as their peers decline suggests that branded merchandise, not textbooks, now drives revenue at successful campus bookstores. The data bears this out: thriving university bookstores show average transactions of $35-73—squarely in the merchandise range—rather than the $200-500 typical of textbook-heavy stores. Only 2 of 47 major bookstores still focus primarily on textbooks, with the "merchandise-heavy" category (transactions under $50) generating $91.4M collectively versus just $3.3M for textbook-level transactions.

Our data also reveals a dramatic economic chasm in spending money between student populations: Elite private schools average $11,456 in on-campus spending per student, while public universities average just $1,047. This depicts fundamental disparities in student wealth, with elite school students spending more on coffee and convenience items than many public school students spend on textbooks. The gap is especially stark at the extremes: the bottom quartile of public university campuses see an average of just $413 in spending per student each year.

Our Methodology

On that note, let’s talk a bit about where our data comes from, how we did this work, and what we mean by “on-campus.”

Enigma profiles every business in the US, breaking out its individual operating locations, then assembling monthly aggregate data on their payment card transaction revenue (credit, debit, EBT, FSA, HSA). Sadly for our campus spending analysis, we don’t yet have visibility into “campus cash” or “dining bucks.”

In order to analyze on-campus spending patterns, we needed a way to figure out which businesses we should count as being on a college or university campus. So first, we needed campus maps. The US Department of Homeland Security used to publish a dataset with the geographic boundaries of colleges and universities, but public access to this platform was disabled on August 26th, 2025. The Excel file with new location data was only online for an additional two weeks, and even then many of the datasets didn’t make it to their new intended storage locations (note to hungry journalists). Luckily, you can still grab a copy of the 2024 campus data from here.

But back to how we used this campus geodata. Sometimes the DHS dataset treats a campus as a large enclosed area encompassing every university building. Other times, the campus maps will outline many individual buildings constituting an urban campus like NYU, where university buildings are interspersed throughout a dense urban environment like Greenwich Village. A ‘campus’ on this map could even be a single building, like a small professional school that rents out commercial real estate.

To figure out which businesses we should count as “on-campus,” we matched our geographic dataset of campus boundaries with a “spatial index” called H3 to connect these boundaries with all the businesses in Enigma’s database through a geographic lattice of hexagons and pentagons. This allowed us to filter for businesses located on campus cells — effectively no more than 132 meters away from the campus boundaries at their furthest (for the nerds: h3 zoom 10). This way we would see everything that’s clearly within the campus, as well as the convenience stores across the street, but not the bar a quarter mile (~400 m) away. Granted, this system will also pick up false positives — say, a high-end, rotating sushi bar across the street from a for-profit university renting a room in an office tower. As a serviceable means of cutting out this noise and focusing on "traditional" 4-year residential campuses with meaningful on-campus economies, we filtered our dataset to have a 10K minimum enrollment and minimum campus revenue of $10M for urban universities, reducing our initial dataset from about 90K to 14K businesses.

As noted, the greatest limitation of this analysis is that our payment card data necessarily excludes the private, internal transaction systems used on campus. Many colleges and universities issue “campus cash” or “dining bucks” that deduct from student commissary accounts. While our analysis draws on data for campus spending that happens through credit cards, the true stories of “campus bucks” shall remain hidden in the vast catacombs of academic secrecy.

KYB vs. KYC: Two Different Paths to Trust in Customer Onboarding

Enigma — Mon, 22 Sep 2025 00:00:00 GMT

Introduction

In today’s financial ecosystem, compliance and trust are non-negotiable. Every transaction, whether made by an individual consumer or a business entity, requires confidence that the counterparty is who they claim to be.

That’s where Know Your Customer (KYC) and Know Your Business (KYB) come in. These two compliance processes are often mentioned together, but they serve different functions. KYC verifies individuals. KYB verifies businesses—which themselves may be corporations, LLCs, partnerships, or even sole proprietors operating under their own names.

Both processes are vital. KYC is the regulatory foundation for consumer-facing businesses, while KYB ensures that organizations can confidently transact with other businesses. Together, they form a comprehensive framework for reducing fraud, managing risk, and staying compliant.

KYC 101: Why Compliance Starts with Individuals

Know Your Customer (KYC) is designed to verify an individual’s identity and assess risk before they are onboarded. It is a key requirement under anti-money laundering (AML) and counter-terrorist financing (CTF) regulations, enforced globally by authorities such as the Financial Action Task Force (FATF), the Bank Secrecy Act (BSA) in the U.S., and the EU’s AML directives.

The goals of KYC are threefold:

Prevent financial crime — stopping money laundering, terrorism financing, and fraud.
Ensure regulatory compliance — meeting obligations under AML and CTF laws.
Build trust in the financial system — ensuring that only legitimate individuals can access services.

Typical KYC Process

KYC generally requires:

Identity verification: A government-issued ID such as a passport or driver’s license.
Proof of address: Utility bills or bank statements.
Watchlist and sanctions screening: Ensuring the individual is not flagged by OFAC, Interpol, or other authorities.
Biometric verification: Liveness detection and facial recognition are increasingly standard to prevent stolen ID use.

Over the last decade, KYC has become increasingly sophisticated. Providers such as Trulioo, Onfido, Jumio, and Socure have emerged as leaders, leveraging AI and machine learning to verify identities in real time. This innovation has transformed KYC from a manual, paper-heavy process into a near-instant digital experience.

But KYC only covers individuals. Businesses require something deeper.

Why KYB Goes Further: Verifying Businesses

When your customer isn’t just a person but a business, KYC alone isn’t enough. A business is more than just a name—it is often a legal entity with its own registration, tax obligations, ownership structure, and potential risk exposure.

That’s where Know Your Business (KYB) comes in. KYB extends compliance to business entities by verifying not only the company itself but also the individuals behind it, such as beneficial owners and key executives.

The Core KYB Questions

KYB typically asks:

Is the business registered and active? Validation through state or national registries.
**Does the Taxpayer Identification Number (TIN) match IRS or equivalent tax authority records?

**
Who are the ultimate beneficial owners (UBOs)? Identifying key owners and other individuals with significant control.
**Does the business or its owners appear on sanctions or adverse media lists?

**
Does the industry represent heightened risk? For example, a company may want to know if their customer operates money services, crypto exchanges, gambling, or adult content..

Middesk’s KYB 101 makes clear that this is not a simple matter of checking a single ID. Instead, it’s about piecing together fragmented and often inconsistent data across multiple sources.

KYC vs. KYB: Two Distinct Compliance Paths

While KYC and KYB are both compliance processes, they differ in important ways:

The key takeaway is that KYB is not “better” or “more advanced” than KYC. Instead, they are different. KYC ensures that individuals are legitimate. KYB ensures that businesses—including those run by individuals—are legitimate. Together, they create a complete compliance foundation.

Why KYB is Particularly Hard in the U.S.

Unlike KYC, where most countries have centralized identity documents, KYB in the U.S. faces unique structural challenges:

Decentralized Registries: In the U.S., legal entities are created and governed by each state separately. Each state maintains its own corporate records with no nationwide equivalent. Access, quality, and update frequency vary widely.
Multiple Naming Conventions: Businesses can operate under legal names, DBAs (doing business as), and brand names. Matching across them is non-trivial.
Ownership Transparency Gaps: Until very recently, beneficial ownership information was limited and inconsistent. The Corporate Transparency Act (CTA) aimed to improve this, but implementation is controversial and ongoing.
High Volume of Small Businesses: The U.S. has over 33 million small businesses, and private businesses in the U.S. generally do not publish sophisticated or standardized public filings.

These challenges make business verification slower, costlier, and riskier compared to consumer onboarding.

The Enigma Difference in KYB

Traditional KYB providers rely on black-box risk scores, designed to be interpreted by analysts in manual workflows. Enigma takes a more transparent and automated approach:

**Granular, Ground-Truth Data

**Instead of abstract scores, Enigma surfaces the raw facts: corporate registrations, TIN validation, ownership, addresses, and watchlist status.
**Speed and Automation

**Enigma’s KYB API delivers verification in under three seconds by resolving fragmented records into confident matches.
**Transparency and Explainability

**Every output is paired with reasoning (e.g., “address exact match with Secretary of State filing”) so compliance teams can understand and trust the result.
**Coverage Beyond Corporations

**Enigma verifies not only corporations and LLCs but also sole proprietors, partnerships, and small businesses—segments often underserved by legacy providers. Enigma, unlike other traditional KYB providers, can verify real sole props and microbusinesses in the US through trusted sources beyond Secretary of State records.

This approach helps reduce onboarding friction while meeting stringent compliance standards.

KYC and KYB Together: The Full Picture

Crucially, KYC and KYB often overlap. A business entity might pass KYB checks, but regulators also require person-level verification of its beneficial owners. In practice:

Sole Proprietors → Require both personal KYC and business KYB.
SMBs and Corporations → Require KYB on the entity + KYC on executives/owners.

This layered approach ensures both the business and the people behind it are legitimate.

The combination is essential in preventing common risks:

Fake businesses used as fronts for money laundering.
Legitimate businesses controlled by sanctioned or high-risk individuals.
Individuals misrepresenting themselves as businesses to bypass fraud controls.

By integrating KYC and KYB together, financial platforms build a holistic defense against fraud and non-compliance.

Achieving best-in-class KYB coverage while reducing EDD Manual reviews

The Enigma KYB API integrates with orchestration platforms like Alloy and Taktile, helping clients such as Wisetack and Chase build cost-efficient, multi-vendor compliance workflows. One payment processor cut per-call costs by 60% by using Enigma first for Secretary of State (SoS) matches, while improving approval rates with trusted sources beyond SoS filings.

Yet even with strong KYC + KYB strategies, manual reviews for Enhanced Due Diligence (EDD) remain inevitable. Compliance analysts often spend 10–30 minutes per case resolving aliases, validating ownership, or digging through fragmented government records. Enigma’s Model Context Protocol (MCP) changes that.

Enigma MCP anchors reviews in a ground-truth business identity graph built from billions of federal, state, and municipal filings. It pre-resolves variations (e.g., DBAs vs. legal names), surfaces ownership hierarchies, and provides structured, source-attributed summaries. This reduces research time to minutes—or even seconds—while increasing confidence in match accuracy.

Some clients are now layering programmatic agents on top of MCP, automating straightforward reviews and escalating only complex exceptions, enabling compliance teams to scale without sacrificing rigor.

Conclusion

KYC and KYB serve different but complementary roles in compliance. KYC ensures that individuals are legitimate; KYB ensures that businesses are legitimate. Both are critical for reducing fraud, managing risk, and meeting regulatory requirements.

While KYC is now a relatively mature and standardized process, KYB remains fragmented and challenging—especially in the U.S., where state registries and ownership data are inconsistent. That’s why KYB solutions require not just data, but intelligent resolution, automation, and transparency.

Enigma is building the KYB infrastructure to meet this challenge. With instant verification, granular data, and explainable results, we help platforms onboard more good businesses faster—while maintaining the highest standards of compliance.

In a digital-first world, the winners will be those who can seamlessly verify both people and businesses. At Enigma, we’re making that possible.

The Hidden Economics of Campus Life, Part 1

Enigma — Fri, 12 Sep 2025 00:00:00 GMT

What $8 Billion in College Spending Reveals

Step onto any American college campus and you'll encounter a carefully orchestrated economy, one that extracts extraordinary levels of spending from its captive audience of students, staff, and faculty. It's a marketplace where a single textbook can cost more than a week's groceries, where parking spaces command luxury hotel prices, and where Sunday laundromats hum with the collective procrastination of thousands. But beyond these familiar college tropes lies a deeper story—one of corporate consolidation, digital disruption, and the surprising resilience of certain business models that have learned to surf the predictable waves of the academic calendar.

We analyzed transaction data from over 70,000 businesses operating within the geographic boundaries of American college and university campuses, tracking billions of dollars in spending across thousands of institutions. What emerged wasn't just a portrait of campus consumption habits, but a revealing look at how traditional institutions are crumbling while new economic patterns take their place.

The Great Bookstore Collapse

Campus bookstores have been bleeding revenue for years, but the pandemic accelerated what was already a terminal decline. The numbers are stark: bookstore revenue has fallen about 60% from its peak, with 2024 revenues remaining 25% below pre-pandemic levels despite students returning to campus. The median campus bookstore saw revenue decline by 52% in 2020 alone—and most have never recovered.

August remains the one bright spot, still generating about 15% of annual bookstore revenue as families load up on textbooks and college-branded merchandise. But even this back-to-school surge can't mask the fundamental problem: students just don't need campus bookstores the way they once did. Along with declining revenue, average transaction sizes have increased from $75 in 2017 to over $95 in 2024, suggesting that students use the campus bookstore only for the most essential purchases, like required course materials that can't be found elsewhere.

The data reveals three critical patterns. First, the COVID pandemic wasn't a temporary disruption but an acceleration of existing trends. When campuses closed in March 2020, bookstore revenues didn't just dip—they cratered, falling by more than 60% at many institutions. Second, the Amazon effect is real and measurable. While some bookstores report 20% or more of revenue coming from online sales, they're not capturing the shift to digital. Students buy from Amazon, Chegg, or directly from publishers, habits that solidified during the pandemic and show no signs of reversing.

Third, and perhaps most telling, is the extreme seasonality that reveals bookstores' vulnerability. The typical campus bookstore sees 5-10x revenue swings between peak and trough months. Some campuses experience 15x or greater seasonal variation—a feast-or-famine cycle that makes sustainable operations nearly impossible. When your entire business model depends on a few weeks in August and January, you're not running a retail operation, but a seasonal pop-up that happens to stay open year-round.

Commercial retailers once thought they could fill the campus bookstore gap. One of the last remaining national brick-and-mortar book retailers operates more than 300 campus locations, all intended to bring efficiency and scale to this struggling sector. Yet our data shows these chain-operated campus bookstores declining just as fast as independents, down 35% from 2017 to 2024. The chain stores do generate 2-3x more revenue than independent bookstores on average, but they're getting pulled down the same whirlpool. The corporate consolidation that was supposed to save campus bookstores has only managed their decline more efficiently.

Small rural campuses with limited off-campus alternatives maintain some bookstore vitality, while urban campuses with multiple retail options have seen bookstores essentially disappear as economic agents. It's a bifurcated future for the campus bookstore: monopoly or obsolescence, with little middle ground.

Campus Dining and its Captive Market

While bookstores collapse, corporate restaurant chains have a solid foothold in campus life. A handful of brands dominate the landscape: Starbucks operates on 280+ campuses, Subway on nearly as many. Together, the top 20 chain restaurants pull in about $1.45 billion in campus revenue.

The concentration is staggering. Our data shows Chick-fil-A generating over $500 million annually from campus locations alone. McDonald’s, Starbucks, Chipotle, Panda Express, and Dunkin' each pull in hundreds of millions more. These chains have cornered the student market and mastered the academic calendar, adjusting operations and staffing to match the predictable ebb and flow of student spending.

The data reveals which chains are winning and losing the battle for student dollars. Rising stars like Chick-fil-A and Chipotle show 20%+ annual growth, expanding aggressively to new campuses. Meanwhile, despite still-impressive market shares, traditional fast food giants like McDonald's and Taco Bell are now losing ground to fast-casual options that better match student preferences for perceived quality and customization.

Regional patterns tell their own story, and competition dynamics reveal corporate strategy in action. In-N-Out burgers have earned customer loyalty in California, just as Whataburger is a local favorite in Texas. Meanwhile, Chick-fil-A dominates campus dining in the South with 50% of the fast food market, operating on 111 campuses with regional revenues of $396 million. On campuses where both Starbucks and Dunkin' operate, Starbucks wins a narrow 52% of the time nationwide, yet still generates on average 60% more revenue than its Massachusetts-based rival. Still, Dunkin' holds its ground in the Northeast, where brand loyalty runs deep and preferences were formed long before students arrived on campus. On Massachusetts and New York campuses, Dunkin' averages 40% higher revenue than the national average.

Summer Survivors and Seasonal Extremes

When students leave for summer, most campus businesses see revenues crater. But some categories show remarkable resilience. Our data reveals which businesses serve the entire campus community versus those that are mostly dependent on the undergraduate population.

Parking maintains 85% of its academic-year revenue during summer months—likely because many staff, faculty, and graduate students still need to park as they work through the slow season. Coffee shops also retain about 70% of their revenue, serving those who work on campus all year round. Convenience stores and pharmacies actually see some of the smallest drops, maintaining 75-80% of regular school year revenue.

But for most other businesses, summer is a wasteland. Bars see revenue drop 70%. Fast food falls 60%. Bookstores essentially hibernate, with some reporting 90% revenue declines in June and July. These seasonal swings create a brutal business environment where nine months of revenue must cover twelve months of costs.

In terms of upswings, the most extreme seasonal patterns also belong to businesses tied directly to the academic calendar. Some campus bookstores see August revenues 20x higher than July. Certain dining establishments report September revenues 15x their summer lows. These aren't just gradual seasonal patterns, but drastic extremes that require careful cash management and flexible staffing to survive.

What This Means

Our data tells a story of transformation through transactions. The American college campus has always been a strange commercial ecosystem—a place where teenagers with limited budgets somehow sustain billions in annual commerce, where monopolies thrive behind the veil of convenience, and where the academic calendar creates business patterns unlike anywhere else. But this ecosystem is evolving rapidly, shaped by generational changes in shopping behavior, technological disruption, and the increasing corporatization of campus life.

The $8 billion we see flowing through U.S. campus businesses each year isn't just consumer spending, but rather the physical manifestation of college life itself, measured in coffee cups and late-night pizzas, in overpriced textbooks and parking tickets, in all the small transactions over the course of an expensive education that launches American youths into adulthood.

Anchor your AI with Ground Truth Business Data

Hicham Oudghiri — Fri, 29 Aug 2025 00:00:00 GMT

We’ve been building something that is transforming intelligence about businesses.

The Enigma AI MCP Server just went live. We spent years building the most complete business identity graph. Now we're plugging it straight into AI systems. No APIs. No integration headaches. No data hunting.

We just gave your AI everything we know about American business, so it can make the connections we've spent years building.

What is it?

Think of MCP like USB-C for AI systems.

Before USB-C, every device needed its own cable. MCP does the same thing for AI and business data.

Before the Model Context Protocol (MCP), connecting AI to your business systems meant custom work for every single tool. Want your AI to access Salesforce? Build a connector. Need it to work in Slack? Build another one. Moving from Claude to ChatGPT? Start over.

We chose MCP because your AI capabilities should follow you everywhere.

Same business intelligence in Claude on the web. Same verified data in your development tools. Same entity resolution in whatever AI platform your team adopts next year.

One integration; nearly every AI tool. When the next breakthrough AI launches, your business intelligence travels with you. No rebuilding, no integration hassle. Every time we update and extend our graph with more signal, your systems already know about it.

The feedback convinced me we built something different

We ran this quietly for months with risk and onboarding teams, investment firms, compliance teams, and business development groups. The feedback? Nobody wants their old workflow anymore.

Here's what happens now: You mention "Joe's Pizza" to your AI. Instantly, it knows this isn't just a corner shop. It's "JOE'S PIZZA NYC LICENSING, LLC" under "THE VILLAGE IN ANN ARBOR LLC" with 9 locations, specific incorporation dates, and registered agents. Complex ownership? Your AI already mapped it. A compliance officer said something that hit me: "I’m still verifying businesses every day. The difference is now my AI knows exactly the business I'm dealing with, and does the hours of research I’d be doing [from secretaries of state, plus the documents provided by the business], and just goes out and does all the additional searching I’d be doing to fill in the gaps. At this point, I’m checking its work and making a decision to deny or approve."

Unlocking the power of authoritative and messy government data

One of the ideas that's been with Enigma since our founding is this: government data holds incredible insights. It's also completely chaotic.

A single company can show up dozens of ways. "ABC Corp" in SEC documents. "ABC Corporation" in procurement files. "ABC Co." in violation records. Sometimes at different addresses, with different people, with different identifiers. Legacy systems see these as different companies.

Sometimes they're right to, but often, they're dead wrong.

We spent a decade untangling this identity maze. Connecting the dots. Making order from chaos. Our Identity Graph knows that "Mike's Pizza" on the corner actually belongs to a multi-state franchise with specific formation dates, legal structures, and operating permits.

The breakthrough came when we connected this graph to AI reasoning power. Scattered government records became coherent business stories. Ask about any company, get the complete picture. Not just their marketing name - their real identity. And armed with this identity data, your AI can search Enigma’s government archive for relevant records about the business - newly issued licenses, safety violations, connected entities and partners, or take that as a starting point to perform deep research on that business across the web. We found that when provided with Enigma’s trusted identity graph information, even relatively simple AI models could correctly distinguish the details of Famous Rays' Pizza; keeping it separate from Ray's Original Pizza and certainly not Ray's Bar.

With this launch, we are also introducing the Enigma Government Archive, the largest and most comprehensive linked record of official government filings. We scanned over 500,000 official source constantly-updated datasets to collect billions of records on business entities. Accessing this repository through Enigma’s Agentic tools to turn unstructured queries into actionable data is just the beginning of where you’ll be seeing this data and its capabilities.

Intelligence that moves with you

Intelligence about businesses now travels everywhere. Claude instantly recognizes if that potential partner has compliance red flags. Cursor evaluates vendor stability while you code. Research assistants discover market gaps by reasoning across millions of U.S. businesses.

This goes beyond research assistance: it’s judgment. Whenever you need business entity insights, the intelligence waits there. Ready and anchoring your AI with real world truth.

Over a decade in the making

After over a decade of building our knowledge graph, watching AI navigate it feels like magic. We can't wait for you to experience it and build with us.

Hicham Oudghiri is CEO and co-founder of Enigma.

Sole Proprietorships, Micro-Businesses, and KYB: Why Traditional Verification Fails 82% of U.S. Businesses

Michael Niu — Thu, 14 Aug 2025 00:00:00 GMT

According to the U.S. Small Business Administration, 81.9% of businesses in America operate without employees—and the vast majority of these are sole proprietorships. While this term might not be what we use daily to describe your local coffee roaster, freelance designers, rideshare drivers, or the countless other micro-businesses informally owned by a single person, this humble structure forms the backbone of American entrepreneurship.

By definition, a sole proprietorship is an unincorporated business owned and operated by one person. Unlike registered corporations or LLCs, most states require no formation documents. The business exists as the legal extension of its owner—a feature that makes perfect sense for entrepreneurs asking, "Why bother filing paperwork for Corporate Personhood when I'm already a Real Person?"

Sidenote: While sole props might not register with Secretaries of State, paperwork often does exist for micro-businesses. In particular, local regulations may require a microbusiness to register a Fictitious Business Name (FBN) or the name they Do Business As (DBA) with state or municipal authorities.

This simplicity that makes sole proprietorships attractive to entrepreneurs creates a verification nightmare for financial institutions. Millions of these businesses operate with real economic impact: they file taxes generating over $1.3 trillion in annual receipts, open business bank accounts, and employ contractors. Yet unlike their incorporated counterparts, they often lack the paper trail that modern KYB systems depend on. This puts fintechs and banks in an impossible position—choosing between regulatory compliance, fraud prevention, and business opportunity.

Why Sole Proprietorships Are Hard to Verify

At a structural level, sole proprietorships defy the typical inputs of most KYB checks.

EIN (Employer Tax ID Number): While a sole prop can register an EIN, many don’t have one; instead they may file taxes under the owner’s SSN
Business registration records: Typically don't exist at all
Business name (DBA): Fictitious Business Name (FBN) may not be filed even though their DBA is used in practice
Legal Documents: No articles of incorporation or other organization docs
Business Address: Often is a mixed-use address like a home office or can be inconsistent between records

The Hidden Cost of "Just Skip Them"

When faced with these verification challenges, three tempting but flawed strategies often emerge:

"Just reject them all" – This approach would mean turning away three-quarters of American businesses, including established operations generating hundreds of thousands in annual revenue. For payment processors and financial services companies, this isn't just leaving money on the table—it's abandoning the entire market.

"Default to manual review" – With sole proprietorships representing the majority of applications, manual review quickly becomes unsustainable. After a micro-business misses verification on a KYB check, processes either attempt to fall back to running a second, more costly KYC check, or skip directly to manual review, costing $20-50 per case and processing times of 24-48 hours. The economics simply don't work for businesses expecting instant onboarding and customers who demand it.

"Lower the verification bar" – Perhaps most dangerous, loosening standards for sole props creates an exploitable gap in your defenses. Sophisticated fraudsters understand these systemic weaknesses and actively target institutions with inconsistent verification standards.

The reality is that regulators don't care about your business structure when enforcing BSA/AML requirements. A sole proprietorship laundering money or facilitating fraud creates the same regulatory risk as any corporation—but traditional KYB tools leave you blind to 73% of the market.

What That Means for Risk and Compliance Teams

False negatives: Rejecting real, successful businesses that don’t meet formal requirements
Manual review bottlenecks: Time-consuming escalations to sort out ambiguous profiles
Missed fraud risk: Approving bad actors who leverage the ambiguity of sole proprietorships to bypass common screens

Verifying Sole Props with Confidence: Enigma’s Multi-Source Approach

Rather than relying solely on corporate filings, Enigma aggregates alternative signals that prove business activity. The approach is straightforward: if a business is real, it leaves traces everywhere—just not in traditional corporate databases. Here’s how:

Online Presence – Many sole proprietors market themselves through social media or personal websites. For example, Paul’s Auto Repair maintains an active presence at Facebook company page, and Anelas Jewelry operates via Etsy, showcasing its products and customer interactions.
Industry Directories & Review Listings – Some businesses appear in local or industry-specific directories even if they’re not formally registered. Paul’s Auto Repair in Wilmington, DE, operates as a Tire Dealer & Repair Shop and is included in review listings (>80 reviews) despite lacking corporate registration. On the other hand, Anelas Jewelry is a well established business in Etsy. Both businesses would be rejected in traditional KYB checks.
Key Individuals & Ownership Data – Enigma can provide business ownership, ensuring legitimacy even when corporate registration is absent. Enigma verifies a sole proprietor’s SSN directly with the IRS to verify the business owner is real.
Revenue Insights – Estimated revenue and transaction data offer additional confirmation of business activity. Pauls Auto Repair in Delaware generates an estimated $<500k card revenues annually, while Anelas Jewelry brings in around $<242k. These types of signals provide clear indicators that consumers have a track record of purchasing from a business over many years as seen with Enigma data.

Paul's Auto Repair has processed more than $500,000 USD in payment card transactions each year over the past 5 years, but has no corporate registration.

Traditional KYB would have rejected both these established businesses—walking away from nearly $750,000 in combined annual transaction volume from just two merchants. This is why leading verification platforms integrate Enigma into their KYB workflows. When orchestrated through platforms like Alloy, Enigma becomes the critical first step that catches the sole proprietorships other providers miss—all while maintaining the speed and controls that modern platforms demand.

AnelasJewelry has a 12-year track record on Etsy with over 88,000 sales to customers, all operating as a sole proprietor.

Enigma continues to source and incorporate more public data, such as DBA business filings at the local and county level, to expand its business verification data ecosystem. The industry is taking notice—companies like Middesk are also exploring approaches to the sole prop challenge. But the solution requires more than recognition; it demands access to alternative data sources that can reliably verify business activity without traditional corporate filings.

Helping Companies See the Full Picture

For businesses that rely on accurate verification—whether for risk assessment, lending, or marketing—having access to a broader dataset is crucial. Enigma bridges the gap by surfacing sole proprietorships that aren’t captured in traditional records, enabling smarter business decisions and reducing risk.

Sole proprietorships are strange beasts: legally valid, economically important, but structurally difficult to verify. They don’t fit neatly into the frameworks we’ve built for corporate due diligence, but that doesn’t mean they’re inherently high-risk.

It just means we need smarter systems.

The path forward isn’t magic, it’s multi-source, real-world verification, layered together to form a coherent picture. When you start asking “Can I prove this business is real, active, and trustworthy?” instead of “Does it have Articles of Incorporation?” you begin to see through the noise.

In an age of rapidly evolving fraud, powered by AI, that clarity matters more than ever. The sole proprietorship challenge isn't going away. If anything, the gig economy and rise of solopreneurship mean these businesses will only become more prevalent. Risk and compliance teams can't afford to wait for traditional KYB to catch up.

For teams evaluating their verification stack today, the question isn't whether to verify sole proprietorships—regulators have already answered that. The question is whether your current approach can see what Enigma sees: real businesses generating real revenue, just without the corporate paperwork.

The Mystical-Industrial Complex

Enigma — Thu, 24 Jul 2025 00:00:00 GMT

In this issue: discovering businesses like a mind-reader

In these times, hyper-specialization is the name of the game and Enigma’s here to help you identify exactly who you’re looking for. Sure you can use Enigma to learn more about a business you can already identify, but what if you need to find the right prospects?

Escape the poor targeting of traditional industry labels and find businesses by what they actually do, not labeled as someone else says they are.

In this video, Navya shows how to use the spookily accurate Semantic Search to find ideal lists of businesses, in this case psychic mediums; all part of the Enigma Console.

The Mystical-Industrial Complex

Something extraordinary is happening in America's spiritual marketplace. While the Wall Street Journal discovered that witches are crushing it on Etsy—with desperate job seekers dropping $15 for employment spells—that's just the demand side of the story.

The real magic? How America's 120,000+ metaphysical businesses are building a $11.8 billion mystical economy.

Opening the Third Eye (on Data)

To understand this hidden economy, we analyzed transaction data from metaphysical businesses nationwide using Enigma's new Console. We used the new Semantic Search to find exemplary metaphysical businesses, then identified others that shared high similarity with their business description. We focused on truly mystical enterprises—those with "woo-woo factors" of 6 or higher:

Maximum Mysticism: Fortune tellers, psychics, occult & magic stores
High Mysticism: Spiritual supply stores, crystal shops, botanicas
Substantial Mysticism: Energy healers, mind-body practices, naturopaths

Together, these businesses generate $11.8 billion annually—roughly equivalent to the half the U.S. yoga industry. But here's where it gets interesting.

Finding #1: The Crystal Boom That Actually Happened

Forget toilet paper—the real pandemic hoarding was healing crystals.

While most mystical businesses saw modest pandemic gains, one category absolutely exploded:

Crystal & Mineral shops: +32% growth during pandemic
Psychics & Divination: +16% growth
Occult & Magic: +10% growth

But here's the kicker: crystal shops have maintained their elevated revenue levels. Psychics? They've returned to earth. The data suggests Americans didn't just panic-buy amethyst—they kept buying it.

Finding #2: The $5,000 Psychic Reading Is Real

Premium mysticism commands premium prices.

While Etsy witches compete on $15 spells, established mystical businesses operate in an entirely different universe

One psychic studio averages $4,028 per reading. That's not a typo. These aren't corner palm readers—they're spiritual consultants to the wealthy, offering multi-hour sessions, ongoing guidance, and what one practitioner calls "executive soul coaching."

Finding #3: The Great Digital Divide

Ancient wisdom meets modern commerce—reluctantly.

Despite operating in 2024, the mystical economy remains stubbornly analog:

Only 10.7% of psychics accept online payments
Just 19.7% of spiritual supply stores have e-commerce
But those who do? They're killing it—averaging 50%+ of revenue online

The opportunity is staggering. If the 90% of offline psychics went digital at current conversion rates, it would unlock an additional $2.1 billion in transaction volume.

Finding #4: Energy Work Is the New Work

63,503 businesses, $5.2 billion in revenue, and growing.

The biggest success story in mystical business isn't psychics or crystals—it's energy work:

Reiki practitioners: 31,847 businesses
Energy healers: 18,292 businesses
Chakra therapists: 8,194 businesses

These practitioners have cracked the code: position mysticism as healthcare. They accept insurance (sometimes), get referrals from doctors (occasionally), and most importantly, solve concrete problems (stress, pain, anxiety) with mystical solutions.

Average revenue? $81,306 annually. Not Fortune 500 money, but solid middle-class income from moving invisible energy.

Finding #5: October Is Overrated

The real money is in Mercury retrograde.

Yes, October sees a bump for occult businesses. But the fascinating pattern is elsewhere:

March, July, November: Peak months for psychics (Mercury retrograde)
December: Massive spike for crystal shops (holiday gifts)
January: Boom time for energy healers (New Year intentions)

Smart mystical businesses have moved beyond Halloween. They're tracking astrological events, creating content around moon phases, and turning cosmic phenomena into commerce opportunities.

Finding #6: The Mystical Middle Class

Neither Etsy witches nor luxury shamans—meet the $150K mystical entrepreneur.

The data reveals a robust middle tier of mystical businesses:

16,776 businesses earning $100K-$500K annually
Average transaction: $127-$203
Typical services: 60-90 minute sessions, crystal consultations, spiritual readings

These aren't side hustles or celebrity gurus. They're professional practitioners who've built sustainable businesses by finding the sweet spot: mystical enough to differentiate, professional enough to trust.

What's Really Happening Here

The $15 Etsy spell is a sideshow. The real mystical economy is built on trust, professionalism, and solving real problems with spiritual solutions.

Successful mystical businesses understand three things:

Premium positioning works: A $3,000 psychic reading sells better than a $30 one
Digital is optional but lucrative: The few who embrace it see 50%+ online revenue
Mysticism is seasonal: But not just Halloween—every cosmic event is a sales opportunity

The mystical economy isn't just surviving—it's professionalizing. When 120,000+ businesses generate $11.8 billion annually, we're not talking about a trend. We're watching an industry mature.

The witches of Etsy may be making headlines with budget spells, but the real magic is happening in America's 63,503 energy healing centers, 10,762 psychic establishments, and 1,831 crystal shops—where ancient wisdom meets modern commerce, and business is mystically good.

Methodology: Analysis based on Enigma's comprehensive business database covering businesses with metaphysical classifications rated 6-10 on our "woo-woo factor" scale. Transaction data from January 2018 through December 2024. Revenue figures represent card transaction data.

Datapoints

Google & OpenAI are fighting over a high school math competition [TechCrunch].
Is Zyn a White Collar PED [Colossus Review]?
Even with 500m users, Meta hasn’t been able get WhatsApp Pay to take off in India [Rest of World]
Sure, I guess selling the records pulled from dead people’s hacked computers to debt collectors counts as an “Alternative Data Startup” [404 Media]
Are your top-notch remote engineering hires obsessed with Minions? They could be a North Korean agent. And if not, still isn’t a good sign. [Wall Street Journal]

Announcing the New Enigma Platform, Re-Architected for Programmatic Intelligence

Enigma — Wed, 09 Jul 2025 00:00:00 GMT

For every business leader trying to size a market or every sales team trying to find the right leads, there's a frustrating, hidden truth: most business data is a mess.

It exists as flat, disconnected lists. "Mike's Pizza" in one file has no connection to "Michael's Pizzeria, Inc." in another. A brand's popular online store is treated as a separate entity from its legal registration. It’s chaos, and it forces you to build your strategy on a broken foundation.

At Enigma, we believe this is a solvable problem. But it requires a fundamentally different approach. It’s why we spent the last year rebuilding our entire platform from the ground up.

This post is for those who want to look under the hood. It’s a guide to the architecture that powers the new Enigma platform and how it enables you to see the business world with unprecedented clarity.

Why Most Business Data is Broken (And How We Built a Better Foundation)

The simple truth is that most business data platforms are built on a flawed architecture. They treat businesses as simple rows in a spreadsheet, failing to capture the messy, complex reality of how they actually operate.

A single business has multiple identities. It's the brand your customer sees ("Wildflower Goods"), the legal entity on its tax forms ("WF Holdings LLC"), and the network of physical locations where it operates. Without accurately connecting these dots, your data isn't just incomplete—it's actively misleading.

This is the foundational problem we solved.

We created the Enigma Identity Graph, a proprietary knowledge graph that organizes the chaotic reality of the business world into a clean, connected map. Instead of just stitching lists together, our graph understands the crucial relationships between entities.

Here’s what that means in practice: When a fintech partner tries to onboard "Wildflower Goods," our graph instantly and confidently links it to the verified legal entity "WF Holdings LLC." This is the difference between a stalled application that requires days of manual review and a seamless, automated approval.

It’s an architecture built for clarity, ensuring you have a single, reliable source of truth for every business you interact with.

Capability 1: See the True, Holistic Picture of Every Business

Because we built a true Identity Graph, you can finally connect all the dots. Our platform accurately links parent companies to their brands, franchises to their owners, and DBAs to their true legal entities. This is the difference between simply targeting "fast-casual restaurants" and being able to identify a single, fast-growing chain onboarding dozens of new franchisees. It allows you to understand the true structure of an enterprise opportunity and engage with the right decision-maker.

Capability 2: Discover Unfindable Markets with Semantic Search

Because we use semantic embeddings, you can finally find businesses the way you think about them. Your ideal customer profile isn't a NAICS code. It’s a concept, like "a boutique hotel with a wellness focus" or "family-run hardware stores." Our new Console allows you to search for businesses with this kind of descriptive, natural language. Our model understands the intent behind your query, surfacing the most relevant businesses, even if they don’t use your exact keywords. This is how you find the niche markets your competitors don't even know exist.

Capability 3: Access High-Fidelity Intelligence You Can't Get Elsewhere

Because our data is precisely linked, you can finally trust high-value signals. Our graph's accuracy allows us to confidently attach proprietary merchant card revenue data. We can tell you not just that a business exists, but its actual merchant pulse. With a 70% success rate in estimating card revenues within ±30% of actual values, we provide a verifiable signal of health that replaces guesswork with ground truth.

Semantic Search in the Enigma Segment Explorer helps find the surprising, emerging, and taxonomy-breaking businesses.

Intelligence Where You Work: The Console & Our APIs

This intelligence is delivered through two powerful interfaces designed for different workflows:

The Enigma Console: Your interface for human-speed discovery. It’s a sandbox for GTM leaders and analysts to explore the business landscape, test hypotheses, and build hyper-targeted lists with an intuitive, self-serve UI.
Our APIs: Your engine for machine-speed automation. Our GraphQL API allows your developers to retrieve a rich, connected profile of any business in a single, efficient query. Our revamped KYB API powered by Enigma graph-model-1 (now in private preview), provides a turnkey solution to automate and scale your compliance and onboarding workflows. This is how you deliver programmatic insights directly into the tools your teams live in every day.

We didn't just rebuild our product; we built a new foundation for intelligence about businesses. We invite you to see the difference for yourself.

Ready to see it in action?

Explore the new Enigma Console and start exploring
Dive into our API and start building
Read our CEO's vision for a programmatic world to see where it's all going

Data Built for a Programmatic World

Hicham Oudghiri — Wed, 09 Jul 2025 00:00:00 GMT

For the past decade, the world of business intelligence has operated on a simple but flawed model: humans asking questions of static data. We pull lists, we run queries, we build dashboards. But the scale of our questions is quickly outpacing the tools we have to answer them.

The next decade will be defined by a different model: machines asking questions of dynamic, connected data. AI agents, integrated into our most critical workflows, will demand intelligence that is not just accurate, but structured, reliable, and available programmatically.

The problem is, most business data today is not built for this future. It’s a messy, fragmented collection of files that requires immense human effort to become useful.

At Enigma, we decided it was time to build for the world that’s coming, not the world that’s been.

Today, we’re launching the new Enigma platform, the result of a complete, ground-up rebuild of our entire data infrastructure. At its heart is graph-model-1, a proprietary knowledge graph designed from first principles to be the trusted, machine-readable source of truth for the U.S. business economy.

We didn't just clean up the data; we gave it a grammar. Our graph understands the complex relationships between brands, legal entities, and their locations, creating a foundation that is ready for the demands of programmatic intelligence.

With this new platform, your teams can immediately:

Leverage our unique card revenue data, now linked with even greater precision to the correct business entity, giving your models a powerful, real-world signal of financial health.
Access this intelligence through our new GraphQL API, architected to allow developers to ask complex, relational questions with unprecedented ease and flexibility.
Explore and test hypotheses in our new self-serve Console, a powerful UI for your human experts to find insights that can then be scaled and automated.

We built this because we know where the world is headed. Before an AI agent can answer your question, it needs a reliable source to query. Before it can enrich your internal data, it needs a canonical key to join on. Before it can be a trusted co-pilot for your enterprise, it needs a platform that speaks its language.

The new Enigma platform is that foundation. It's data, ready for a programmatic world. Very excited for everyone building with us and we have a lot more for you coming on the back of graph-model-1 ✨

Hicham Oudghiri is the CEO and cofounder of Enigma.

KYB vs AI Fraud

Michael Niu — Fri, 13 Jun 2025 00:00:00 GMT

On November 13, 2024, FinCEN issued FIN-2024-Alert004, a warning about a surge in fraud schemes using deepfake media and generative AI to exploit identity verification systems. While the alert focused on fraud targeting individuals, its implications extended to the verification of businesses as well.

FinCEN highlights fraudsters leveraging generative AI tools to create falsified documents and media that evade conventional identity and due diligence controls. They note:

“Fraudsters are using GenAI as a low-cost tool to exploit financial institutions’ identity verification processes, including by creating falsified documents… to circumvent customer identification and due diligence controls.”

The alert underscores that many fraud schemes combine these synthetic elements with hacked or leaked personal data to increase credibility, making detection more difficult.

Evolving Tactics of Synthetic Business Fraud

Although the FinCEN alert focuses on individuals, the tactics described can be applied to business identities with alarming effectiveness. Synthetic businesses may incorporate:

Fake Articles of Incorporation and business registration documents crafted or altered using AI tools
Stolen EINs or SSNs that are valid numbers but now reside in a fraudster’s database
Phony business websites built with AI-generated content, complete with realistic product descriptions and executive bios
Manufactured digital footprints such as AI-generated social media profiles or business listings
Fabricated contracts, leases, or supplier agreements created to mimic legitimate operational documents

Each of these elements can be assembled quickly, cheaply, and convincingly — allowing synthetic businesses to slip through simple KYB screenings that rely on a limited set of data points.

How to Build a Multi-Source KYB Approach

Defending against synthetic business fraud requires moving beyond single-source checks toward a multi-source, multi-factor verification strategy, combining:

Cross-Registry Validation: Confirm business registration, ownership, and EIN across federal, state, and trusted third-party databases

Digital Presence Analysis: Evaluate website authenticity (domain age, hosting details, web traffic patterns), social media presence, news mentions and adverse media, and archived digital footprints

Operational Evidence: Verify real-world business activities such as brick and mortar locations, headcount growth, financial transaction activity, revenue patterns, and key personnel identification.

Ownership and Beneficial Owner Verification: Independently verify UBO identities through government records or third-party data sources

Behavioral and Transactional Monitoring: Monitor early transactional behavior for anomalies or patterns indicative of synthetic entities

Deepfake and Document Forensics: Use AI and forensic tools to detect document manipulation, inconsistencies, or fabrication

Successful KYB programs prove business existence across multiple trusted data sources, reflecting actual operations in the real and digital world, not just a few documents or disclosures which are easily falsifiable by LLMs.

Develop Resilience to Synthetic Business Fraud

FinCEN’s November 2024 alert underscores a clear reality seen across industries: fraud powered by generative AI and synthetic media is not just a threat to individuals — it’s rapidly evolving in the business domain.

If your KYB process still relies mainly on superficial validation of publicly accessible information or easily faked documents, fraudsters can and will bypass your defenses at an increasing rate. If your KYB process lags behind your other anti-fraud and security measures, then bad actors may see it as a vulnerability to exploit, especially in a world where establishing convincing fake companies and exploiting compromised business profiles is increasingly trivial.

Because the tools are available for virtually no cost, the question is no longer if but when. Proactive, multi-source KYB verification is the frontline defense fintechs and financial institutions must adopt to stay ahead of increasingly sophisticated synthetic business fraud.

The "Do Not Call" List is Working

Hicham Oudghiri — Thu, 05 Jun 2025 00:00:00 GMT

You might not have noticed, but the Do Not Call list is quietly working really well.

Enigma crunched the latest Federal Trade Commission data and found that while the Do Not Call list has grown to include about 250 million numbers, Do-Not-Call complaints have plummeted.

The drop off is stark: Americans lodged 5 million Do-Not-Call complaints in fiscal year 2021 and that number has now fallen to 1.5 million in fiscal year 2025.

Big dialers have been squeezed out by tougher enforcement, and that means that the current complaint file is pure signal: if a brand (or the carrier behind it) is still showing up, something’s wrong.

That’s exactly the kind of low-noise, high-yield input that KYB teams should fold into onboarding and monitoring processes to spot risky merchants and VoIP partners long before the usual checks light up.

This month’s Hall of Shame – Top Do-Not-Call List Complaint Generators

• e-liquid.com — Reddit “paid-but-never-shipped” stories

• atmdepot.com — “passive-income ATM” cold calls, rock-bottom trust score

• thesunnycompany.com —viral free-swimsuit fiasco of 2017 returns with a robocall spike

• inetbatelecom.com — VoIP wholesaler fresh off an FCC penalty

• sentrycredit.com — debt collector with a harassment rap sheet

• ecoselectfurniture.com — tiny furniture site now seemingly offline, national complaint burst

• senecadd.org — county agency likely being spoofed for impostor scams

Enigma developed this list by taking the latest FTC dataset and normalizing and deduped the phone numbers. We then resolved each caller ID to a brand and originating carrier with our phone-to-domain lookup. Then we ranked the list by total complaints and complaints-per-unique-number to surface outsized offenders, and cross-checked the results with Reddit, ScamAdviser, BBB, FCC consent decrees, and PACER to separate fraud from harmless misdials.

What’s left are exactly the kinds of companies that should have your KYB process to weed out as early as possible.

Do-Not-Call (DNC) Reported Calls Data

Rebuilding the Foundation: Our Journey to Create Enigma graph-model-1

Ryan Green — Wed, 28 May 2025 00:00:00 GMT

The decision to rebuild a product from the ground up is a high risk/ high reward undertaking. Not only is it expensive and stressful but most rebuilds fail. After an intense twelve-month build cycle, I'm reflecting on why we made this decision and what we've learned along the way.

The Invisible Backbone of the American Economy

Drive through any stretch of highway in America, and you'll pass dozens of establishments that represent years—sometimes generations—of labor and care. The tens of millions of US private businesses are the engines of our economy that create employment, drive innovation, and enable social mobility.

Yet in our increasingly digital world, these businesses are poorly represented in our information systems. The local auto repair shop, the family-owned restaurant, the innovative startup in a converted warehouse—these vital entities often exist as fragmented, inconsistent data points across disparate systems.

This fragmentation isn't just an abstract problem. It creates real friction in our economy: lenders struggle to assess risk accurately, digital companies can't serve these businesses effectively, and municipalities lack the insights needed to support local economies.

Why We Rebuilt From Scratch

Eventually every system encounters fundamental limitations where incremental improvements yield increasingly marginal benefits. Our existing architecture had served us well, but to achieve our vision of creating the definitive mapping of US businesses, we needed to rethink our approach.

The old system struggled with the inherent complexity of how businesses exist in the real world—the messy relationship between brands, locations, and legal entities. Data freshness was inconsistent. And we couldn't leverage the recent advances in AI and machine learning in a thoughtful, integrated way.

After months of analysis and prototyping, we made the difficult decision to rebuild from the foundation up. (we called this “burning the boats”).

A Data Model That Reflects Business Reality

One of the most critical insights that drove our rebuild was recognizing that every business has two distinct identities: its brand identity (how it presents to customers) and its legal identity (how it interacts with financial and legal systems).

Traditional approaches typically conflate these identities or prioritize one over the other. But understanding a business requires comprehending both facets and how they interrelate.

For example, your favorite coffee chain might operate under a single recognizable brand, but behind that cohesive customer experience lies a complex web of franchise agreements, holding companies, and local LLCs. For compliance purposes, you need to understand the legal structure. For market intelligence, the brand relationships matter more.

Our data model explicitly separates and connects these entities, allowing flexibility in how the data can be used. You can apply KYB (Know Your Business) filters to prospecting lists to pre-qualify customers. Or we can provide operational and market signals about a business undergoing compliance checks.

We've published our data model documentation and exposed it through an expressive GraphQL interface, allowing developers to query this complex relationship network in intuitive ways.

Obsessive Focus on Timeliness and Precision

The half-life of business data is notoriously short. Locations open and close, ownership changes hands, and web presences evolve continuously.

We've architected our systems to reevaluate every physical address in the US at least every 90 days which allows us to promptly discover new businesses and identify locations that have closed. We inspect each US business domain with the same frequency.

This required building sophisticated orchestration systems to manage billions of data points and process terabytes of information efficiently. We've developed proprietary confidence scoring algorithms that help us prioritize where to direct our computational resources and human attention.

A central element of our development process was to re-evaluate our sources of small business data and discover new sources. During this process, we reconfirmed that (unfortunately) many of the most widely used sources of business data suffer from serious quality issues. This strengthened our commitment to provide a superior product.

A Data-First Approach to AI Integration

An important aspect of our rebuild was determining how to thoughtfully integrate recent advances in generative AI. The pressure to embrace AI has led to many superficial AI integrations that generate more hype than value.

We've taken a different path—one that starts with data quality as the foundation. We confirmed that traditional data cleaning techniques and classical statistics are effective for many problems. This creates a solid foundation on which to apply newer AI models in targeted areas where they’ve proved to be exceptionally powerful.

For instance, our entity resolution systems combine traditional probabilistic record linkage techniques with transformer-based models that can understand contextual relationships between entities. The hybrid approach gives us the best of both worlds: the interpretability and stability of classical methods with the powerful interpretive capabilities of modern AI.

Similarly, we've deployed AI-focused approaches in ways that augment our human teams and processes. We've built custom AI agents that evaluate data quality, suggest improvements, and fix issues faster than would be possible with human intervention alone. These systems compound our ability to rapidly improve data quality and build new features.

Human and Agent-based Curation

Even the best statistical models make mistakes when data is hard to interpret. My personal favorite example: the two completely independent Giant Supermarket chains that operate in adjacent states but have no corporate relationship whatsoever.

To address these edge cases, we've built a mechanism to establish our own definitive set of facts about businesses. At the core of Enigma's product is a database that allows us to assert facts that refine our statistical models and AI agents.

We've extended this capability to our customers, giving them the ability to suggest corrections when our data is wrong or incomplete. We review these suggestions daily and incorporate valid corrections within a seven-day window. This creates a virtuous cycle where our data accuracy continually improves, focused on the areas that matter most to our customers.

Pulling Back the Curtain: Our Engineering Journey

Our journey has been replete with novel technical challenges and surprising learnings. In the coming months, we’ll dive deeper into several of these topic:

Building a distributed, fault-tolerant pipeline capable of handling billions of signals
Developing novel approaches to entity resolution that blend traditional techniques with modern AI
Creating confidence scoring algorithms that help us allocate attention to the most uncertain data points
Engineering a system for domain-specific human review that maximizes expert time
Designing a GraphQL API that makes complex graph relationships intuitive and discoverable

Our goal is to share a candid account of our journey–particularly where we made mistakes and where our initial hypotheses turned out to be incorrect. Engineering is messy, filled with false starts and unexpected revelations. So expect an unsanitized version.

If you're an engineer, data scientist, or product builder working on complex data problems, I’m optimistic that you'll find something valuable in these posts—whether it's a technical approach you can adapt or simply the reassurance that other highly talented teams struggle with challenges similar to the ones you may be facing.

The Path Forward

As we put the finishing touches on our initial rebuild, we're transitioning to a new phase focused on expanding the reach and impact of this work. We're developing industry-specific extensions, enhancing our API capabilities, and deepening our integration with workflow tools where business decisions are made.

The ultimate measure of our success is the value it delivers to our customers: helping lenders make better credit decisions for small businesses, enabling software companies to serve the middle market more effectively, and giving businesses themselves better insights about their competitive landscape.

If you're working on problems where an accurate understanding of US businesses is critical, I'd love to connect.

Ryan Green is the Chief Technology Officer at Enigma, where he leads the development of data products that bring clarity to the complex world of private businesses.

Rise of Takeout and Delivery

Enigma — Fri, 23 May 2025 00:00:00 GMT

Data You Can Trust- Because We Don't

Here at Enigma, we have trust issues. Not the relationship kind (okay, maybe a little), but the data kind. We don't trust that a business listing from 2019 is still accurate. We don't trust that two "different" companies aren't actually the same pizza shop with questionable record-keeping. We certainly don't trust that every registered business entity is actually, doing business. So we've built a quality control system that would make a pharmaceutical lab jealous—492 checks, weekly pipeline runs, and enough validation protocols to fill a small library. The result is data so clean you could perform surgery on it.

Which brings us to this month's investigation, where we used this obsessively verified data to tackle an economic puzzle that's been stumping researchers: why did restaurant productivity surge 15% during the pandemic and stubbornly stay there, even as the rest of the economy returned to normal? We dug into the numbers to find out what's really driving this unprecedented shift in how America's restaurant industry operates. The answer might surprise you. Or it might not. But the scale of it definitely will.

Want to understand exactly how we ensure unparalleled data quality? Our technical whitepaper breaks down the methodology we use for data quality used investigations like this one.

Rise of Takeout and Delivery

Something unique is happening in the restaurant industry. Across large swathes of the U.S. economy, labor productivity initially surged during the pandemic. But it's since cooled off and is basically back to the pre-pandemic trend. But not so in the restaurant industry: productivity jumped after an early pandemic shock and has remained a whopping 15% higher than pre-COVID levels since. The Chicago Fed's Austan Goolsbee and his fellow researchers picked apart this puzzle using cell-phone location data and found the answer: takeout and delivery orders. But what's the input driving that change? Just how much have takeout and delivery orders increased?

Enigma data gives us the answer, which is a lot:

Takeout and delivery orders have more than doubled since 2017, when measured as the percentage of restaurant revenues that came from online orders. That would be a big shift for any industry, but for one with notoriously high fixed costs and low margins, it’s reoriented how restaurants work – or more precisely, how much work they can do when a bigger and bigger chunk of their business isn’t tied to the number tables they can fill and staff can service. The pandemic forced a structural shift in how restaurants operate. Two years later, that shift has become the new normal—and the productivity gains prove it's here to stay. It's the kind of economic transformation you can only truly see with comprehensive economic data. You know... the kind we obsess over.

Data Points:

The global market for cooling technology like air-conditioning could double in 5 years – BBC
Europe is adding a €2 per package fee to all Temu and Shein orders – Politico
Somehow, American cereal is getting even more unhealthy – NYT
How Ukraine lost hundreds of millions of dollars on bad arms deals – Financial Times
For the first time ever, BYD sold more EVs in Europe than Tesla – Bloomberg
Nike will raise shoe prices $5-$10 for shoes over $100 – CNN
Nestle and the EU are arguing about whether Perrier is “natural mineral water” – NYT

Product Release:

From gig economy workers to digital marketing influencers to your local bodega, verifying a growing range of small businesses is essential to modern KYB. That’s where Enigma’s new Social Security Number (SSN) Verification product comes in. While many businesses use Employer Identification Numbers (EINs) to identify themselves, SSNs are also perfectly valid and used by countless businesses to pay workers and get paid for their hard work. We have you covered no matter who comes knocking—Enigma’s new SSN Verification Product pairs with our existing EIN Verification to provide comprehensive identity verification for small businesses in your ecosystem.

See the Receipts

Introducing Social Security Number (SSN) Verification: Comprehensive KYB for More Small and Micro Businesses

Michael Niu — Tue, 20 May 2025 00:00:00 GMT

From gig economy workers to digital marketing influencers to your local bodega, verifying a growing range of small businesses is essential to modern KYB.

That’s where Enigma’s new Social Security Number (SSN) Verification product comes in.

While many businesses use Employer Identification Numbers (EINs) to identify themselves, SSNs are also perfectly valid and used by countless businesses to pay workers and get paid for their hard work.

We have you covered no matter who comes knocking—Enigma’s new SSN Verification Product pairs with its existing EIN Verification to provide comprehensive identity verification for small businesses in your ecosystem.

What Our SSN Verification Does

Verifying the identity of a business through tax ID records is a common step in many Know Your Business (KYB) and business onboarding workflows. Enigma’s SSN Verification solution focuses on a critical question: does the SSN and name combination provided by a business match IRS records?

This simple but powerful check helps you confidently verify every business against the actual records submitted to and maintained by the U.S. government, even if they are sole proprietors or an informal partnership using an SSN as a tax ID. So even if the business you want to onboard is one of the millions that does not have an EIN, you can still confidently verify its identity using the most reliable tax ID check available.

With our SSN verification solution, you can now:

Confidently validate the identity of sole proprietors and unincorporated partnerships by verifying their tax IDs, even if they are SSNs
Reduce manual reviews and friction during onboarding
Expand access to credit, payments, and financial services for a critical segment of the small business economy

Built for Compliance, Designed for Scale

We know the pressure compliance teams face: satisfy strict regulations, manage risk, and keep your onboarding flow moving. Our SSN Verification product helps you do all three. Whether you're building out a KYB program from scratch or refining an enterprise-grade identity platform, this tool integrates seamlessly into your workflows. It’s not just about checking a box—it’s about enabling faster, smarter decisions with data you can trust.

At Enigma, we’re committed to helping financial institutions and tech entrepreneurs support legitimate businesses of all sizes—all while making their ecosystems safer and more sustainable.

Learn more about Enigma’s SSN verification capabilities in our documentation.

Ensuring Unparalleled Data Quality in Enigma's graph-model-1

Enigma — Wed, 14 May 2025 00:00:00 GMT

Executive Summary

Enigma's graph-model-1 represents the most comprehensive, accurate, and expressive representation of the U.S. business landscape available today. This whitepaper details the robust, multi-layered data quality framework that powers graph-model-1, explaining the methodologies, validation processes, and quality control mechanisms that ensure our data accurately reflects the real world.

For organizations relying on business data for critical functions—whether for compliance, marketing, risk assessment, or strategic planning—the quality of that data directly impacts operational effectiveness and business outcomes. Enigma's rigorous approach to data quality delivers measurable advantages, with validation metrics that consistently exceed industry standards.

Introduction: The Data Quality Imperative

Business data is only as valuable as it is accurate, descriptive, timely, and reliable. The quality challenges inherent in business data are substantial:

Businesses constantly form, evolve, and dissolve
Records become outdated within weeks or months
Information across sources frequently conflicts
The same business may have multiple manifestations (legal entities, brands, locations)
Data entry errors and inconsistencies are common in source records

These challenges are compounded when attempting to create a complete picture of the U.S. business landscape, which includes over 30 million active businesses operating across diverse industries, locations, and organizational structures.

While millions of business entities exist on paper, Enigma's graph-model-1 applies rigorous activity criteria to identify the 13 million "Marketable Brands" that demonstrate genuine market presence. This distinction represents businesses with verified operational signals, revenue generation, and complete attribution data. By identifying dormant entities, shell companies, and paper-only registrations, we ensure our customers build strategies on businesses with actual commercial activity.

Enigma's graph-model-1 addresses these challenges through a knowledge graph approach, combining multiple high-quality data sources with sophisticated entity resolution and linking models, all governed by a comprehensive quality assurance framework. This approach allows us to maintain data accuracy at scale, even as the business landscape continually changes.

Enigma's Multi-Layered Data Quality Framework

Foundation: Trusted Data Sources and Data Quarantine

At the foundation of our system is high-quality data from trusted sources. Enigma carefully vets all data sources to ensure our core records and decisioning rely only on data we can trust. We prioritize authoritative sources such as:

Government registries (Secretary of State filings)
- State business registrations
  - Annual report filings
  - Corporate dissolution records
  - Officer and director listings
- Federal employer identification (EIN) records
- Federal Trade Commission disclosures
Franchise disclosure documents
Medical provider lists
Federal licensing data
Other regulatory sources

Even with these trusted sources, we don't assume all information is correct or consistent across sources. We implement rigorous validation:

We verify and standardize more than 65 million addresses using the United States Postal Service's official US address database
We validate that over 50 million websites are accessible and functional
We quarantine questionable records with invalid or conflicting information for further evaluation
We maintain EnigmaDB, our internal database of record, to correct verified issues at the source, preventing error propagation

Data Pipeline Architecture and Refresh Cadence

Enigma's data pipeline is designed to maintain freshness while ensuring quality:

New source data is ingested daily
The complete data processing pipeline runs weekly
Each pipeline run processes all historical and new data
This approach captures business formations, closures, and operational changes in near real-time
Weekly refresh cycles support time-sensitive use cases like trigger marketing and risk assessment

High-Precision Models for Entity Resolution and Linking

Our data pipeline employs sophisticated statistical models for:

Entity Resolution: Resolving over 600 million raw brand records down to more than 45 million distinct brands and operating locations

Entity Linking: Connecting more than 8 million brands to their associated legal entities

Attribute Prediction: Determining key attributes like industry classification (NAICS codes)

For each model, we develop comprehensive ground truth datasets for validation and won't implement models unless they exceed high precision thresholds. We continuously benchmark new models against baseline heuristics and existing approaches to ensure improvements.

When models don't align with reality (e.g., a recent merger not yet reflected in registry data), we utilize EnigmaDB to manually correct assertions. These corrections not only improve current data but inform future model training.

Quality Validation Methodology

Enigma employs a multi-faceted approach to validation:

Gold-Star Dataset Validation

We maintain carefully curated ground truth datasets:

Brand & Operating Location Set: ~400 brands and 2,500 operating locations
Brand-to-Merchant Linking Set: ~100 brands with 200,000-300,000 merchant links
Store Location Revenue Validation Set: >20,000 stores with externally verified operating location revenues

These datasets are actively maintained and expanded, providing reliable benchmarks for each release. They're designed to cover diverse business types, industries, and edge cases.

Statistical Monitoring

We track key metrics across releases, including:

Fill rates for critical attributes
Revenue shifts and distribution
Projection stability between releases
Entity counts and distributions

Our systems automatically flag deviations beyond defined thresholds, triggering investigation before release.

Pipeline Quality Gates

Our pipeline incorporates two types of quality gates:

Blocking Checks: These enforce zero-tolerance or threshold-based requirements that halt pipeline progress until resolved. Examples include:

Duplicate detection
Relationship rule validation
Minimum coverage requirements

Alerting Checks: These monitor trends without blocking releases, providing visibility into data health over time.

We implement over 492 data checks across 188 datasets and pipeline stages. These checks become increasingly stringent as data moves through the pipeline, ensuring issues are caught early.

Continuous Monitoring and Quality Checks

Real-time Alerting System

Each week when we refresh our pipeline, the data undergoes comprehensive monitoring:

Data Freshness: Ensuring we have the latest available data across all sources
Threshold Alerting: Generating warnings or blocking releases when metrics exceed expected thresholds
Engineering Checks: Preventing regressions and enforcing consistency

Trend Monitoring

Before datasets go live, we verify completeness and consistency against previous releases:

Visual dashboards track trendlines across key metrics
Automated tools identify and surface significant changes
Before/after comparisons highlight distribution shifts

This approach helps identify subtle quality issues that might not trigger threshold alerts but could indicate emerging problems.

Automated and Human-in-the-Loop Validation

We combine automated checks with human expertise:

Automated anomaly detection identifies outliers and unusual patterns
Statistical comparisons against ground truth data validate model performance
Human verification (including LLM-assisted labeling) assesses attribute quality

Each month, we manually review 2,000-5,000 random samples per attribute, with precision targets such as:

95% for operating location attributes
80% minimum for other key fields
98% accuracy for industry NAICS codes
70% of card revenue estimates within ±30% of ground truth

Quality Metrics and Performance

Precision and Accuracy Statistics

graph-model-1 maintains exceptional quality metrics:

Entity Linking: 95% precision in connecting brands to their legal entities. 94% of brands have all links to their operating locations
Industry Classification: 98% accuracy for NAICS code assignments
Location Data: 95% precision for operating location status and addresses
Revenue Estimation: 70% of card revenue estimates within ±30% of actual values.

Focus: Revenue Estimation

This level of accuracy represents a significant achievement in small business revenue estimation, where traditional methods often fail due to:

Limited financial disclosure requirements for privately-held businesses
Limited or non-existent digital footprints for micro-enterprises
High variability in month-to-month revenue for seasonal or emerging businesses
Multiple payment channels that fragment transaction data

For businesses with annual revenues under $1M, industry standard estimates often vary by factors of 2-3x or more, making Enigma's ±30% precision particularly valuable for organizations seeking to effectively segment and target the SMB market. This granularity enables:

More precise market sizing of local business ecosystems
Accurate classification of businesses into appropriate revenue bands for propensity modeling
Reliable identification of high-growth micro-businesses before they appear in conventional datasets
Better allocation of marketing and sales resources based on actual revenue potential rather than proxy indicators

Real-world Performance Examples

Our quality framework delivers tangible business outcomes:

A leading fintech company streamlined their onboarding process using graph-model-1, reducing business verification time by over 40% and lowering false positives in risk flagging by 25%
A financial marketing firm saw their average lift in response improve by 208% using Enigma’s core business records, relative to other business data firms. That uplift jumped to 346% for that same use case when they incorporated Enigma’s revenue metrics as part of that process.
A financial services provider achieved 99.5% recall precision on transaction merchant linking

Business Impact and Conclusion

With graph-model-1, organizations can trust that their decisions—whether for compliance, marketing, risk assessment, or strategic planning—are grounded in accurate, high-quality data.

The business impacts are substantial:

For Compliance: Reduced false positives, faster onboarding, and lower manual review rates
For Marketing: Improved targeting precision, higher conversion rates, and better ROI
For Risk: Enhanced fraud detection, more accurate underwriting, and reduced exposure
For Strategy: Better market sizing, competitive intelligence, and opportunity identification

Enigma's comprehensive approach to data quality isn't just a technical achievement—it's a business driver that delivers measurable value across use cases.

To learn more about how Enigma's graph-model-1 can power your organization's decisions with confidence, contact us today.

New business survival: Texas vs California

Enigma — Tue, 06 May 2025 00:00:00 GMT

A Small Look at SMB in the BIG States

How do you know a business has relocated to Texas? To paraphrase the old joke, you don’t have to ask: They’ll tell you.

The cliche, of course, is not just that everything is bigger in Texas, but everything is better for business in Texas.

Taxes are lower, regulation is lighter, growth is higher. Or so the thinking usually goes.

The reality is more complex. Yes, Texas has no state income tax. But unlike California, Texas has uncapped local property taxes, leading to annual assessments that are routinely more than double the rate that a California home-owner could expect to pay, the libertarian Cato Institute found, significantly narrowing, and for some homeowners, potentially eliminating, the tax-liability gap between the two states.

And while economic growth has been higher recently in Texas, per-capita GDP is higher in California.

In some places, the stereotype does match reality: Texas is indeed a right-to-work state, unlike California; California spends a lot providing health insurance to residents while Texas has the highest percentage of uninsured residents in the country.

Does any of this show up in business survival data? The answer, broadly, is not really:

We looked at operating locations in California and Texas that recorded their first card transaction revenue in January 2022 and followed them over the next few years.

In both states we see an expectedly massive drop in the survival rate early on. That’s just a fact of how businesses are run and it appears to be relatively constant between both states.

There’s a slight advantage to Texas but it’s not massive and tellingly, wanes as the business ages, indicating that the biggest factors impacting business success on a large scale are not the significant variances in state polices, but other internal and external factors. In short, businesses want to be where their markets are; where there’s sizable markets, business finds a way.

Of course, don’t push it. No one would recommend setting up an oilfield services company in Menlo Park or a small surf shop in Fort Worth.

Inside the Shopify Premium

Enigma — Thu, 03 Apr 2025 00:00:00 GMT

Welcome to the April 2025 Enigma newsletter, where big things are happening.

Our big news: Meet Enigma's next-gen data & platform, `graph-model-1`

Before we get to this month's big story, you really should meet graph-model-1, the most comprehensive, accurate, and expressive business data set.

Since 2023, we've been completely rebuilding our entity resolution framework, data processing engine, as well as the Enigma Console, where you can interact with this data and derive new insights.

Here, a “business” isn’t just a single branch or registration record; it’s the sum of everything we know: from its brand identity and legal frameworks to the physical footprints where it operates, the individuals who hold key roles or ownership stakes, and so much more.

With graph-model-1 you can:

Discover emerging businesses: surface newly established companies using minimal input—such as a partial brand name or approximate location—and follow their trajectory as they evolve.
Monitor operational footprints: combine location intelligence with card revenue data to detect expansions, closures, or operational shifts—ideal for market analysis and financial forecasting.
Segment by behavioral insights: filter entities based on semantic understanding of their online presence, transaction patterns, customer feedback, or other key indicators—without relying on rigid, predefined search parameters.

It's all possible because of:

Enhanced entity linking & accuracy: graph-model-1 drastically reduces inaccurate entity resolution (false positives and negatives)—even in complex entity relationship scenarios where traditional systems often fail.
Expanded data coverage: now covering 49.8M brands and operating locations, 98M registered entities, and linking $7.1T in annual card revenues, we capture newly formed or fast-evolving businesses and operational changes in near real-time.
Brand entity embeddings for semantic filtering: identify and group businesses based on nuanced brand relationships, enabling more intuitive queries and segmentation across franchise networks or complex corporate structures.

We're inviting partners to sign up for early access via the API and Console today by signing up for our Research Preview. We're eager for close collaborators and technical partners.

Inside the Shopify Premium

How Shopify outperforms its competition online

Shopify is the dominant online commerce platform, leading, outpacing competitors like Wix, Squarespace and Woocommerce.

Digging deeper using Enigma’s graph-model-1 data, a far more interesting theme emerges: dominant online sellers are choosing Shopify over its competitors.

Or at least, that seems to be the case.

For instance, here’s clothing stores that use Shopify vs those who don’t:

The premium looks roughly the same for candle stores that use Shopify (clearly, someone who is good at economy is helping candle shops):

If it seems that clothing and candle shops were chosen at random, don’t worry – they weren’t. They’re among a cohort of business types where the Shopify revenue premium – the difference in revenue between merchants using Shopify and those who aren’t – is among the highest.

While this data alone doesn’t say anything about causation per se, you might assume that the Shopify premium exists in industries where Shopify has grabbed outsize market share. But that isn’t the case. Jewelry the a category with the highest Shopify penetration, but look at this chart: a very marginal revenue premium for the Shopify merchants across most of the year, and then none whatsoever during the winter holiday season, their biggest period of the year.

Zooming out helps piece together what’s going on.

Coffee shops, pizza places, shoe stores, restaurants, and vitamin and supplements shops consistently show the highest Shopify premium: industries characterized by numerous, small dollar value purchases, physical storefronts combined with online purchases, relatively simple inventories, and a fairly steady churn of failed businesses followed by new entrants.

What industries consistently show the worst Shopify premium? Furniture and wholesale. Industries characterized by large players with established manufacturing capabilities, extensive logistics operations, large per ticket value, a massive number of SKUs, and a comparatively low number of new entrants.

The Shopify premium shows that for Shopify, the customer acquisition process is a search for winners.

The merchants with the highest Shopify premium are in industries with extremely high failures rates. Over time, Shopify cannot succeed as a service provider in those industries by simply maintaining a steady customer base. It must find winners among new entrants. At scale, that means onboarding a lot of eventual losers and holding on to the few winners.

The Shopify premium exists in industries where that’s a viable strategy. It works well for that new coffee shop around the corner that just might make it, but less well for a decades-old supplier of industrial parts to commercial buyers.

(And somehow I made it through writing this whole section without once accidentally writing Spotify Premium.)

Datapoints

A Soundcloud Rapper takes over a major Wall Street investment bank (and no, it's not DJ D-SOL) - Bloomberg
Does luck exist? - NY Magazine
Women pay 1.5% less for car insurance, but your credit score is likely to have a bigger impact on your premium than gender - CNBC
China's Mexican Tariff-Dodge Looks Doomed - WSJ
Reporters made a memecoin and could've rug-pulled their way to a 60,000% profit - Slate
The Fintech-focused bank that won't let you withdraw - Coffeezilla (YouTube)
Consumer goods imports rose 24% in February as tariff frontrunning surged people stocked up on supplies for their Liberation Day parties - NYT

That's it for this month. Send any feedback, story tips, or up-and-coming Fintech Soundcloud Rapper cosigns to updates@enigma.com.

Tracker - Our New Interactive App Powered by Enigma graph-model-1

David Riordan — Mon, 31 Mar 2025 00:00:00 GMT

A few months ago, I got a text message out of the blue from Hicham, our CEO, with only an unrecognized link. He and the team had just rebuilt the search engine for graph-model-1 together, so late one night, he decided to test its new spatial search capabilities by building a business radar. This became Enigma Tracker, a proof-of-concept for geolocating and profiling nearby businesses in real-time.

Enigma Tracker’s only possible because graph-model-1 now offers geocoded data for the operating locations of every business (and we now make that data accessible to our customers). Enigma Tracker uses this to find any businesses within 300 meters of wherever you stand and our Console uses this to show this same data for a single brand. Behind the scenes, we use this spatial data to create more accurate revenue data assignment across a brand’s locations, and to tell you how a given operating location’s income performs relative to similar businesses nearby.

It also gets to the heart of what Enigma is all about: businesses are everywhere. Now you can know what those businesses are, where they are, what they’re doing, and whether you want to be working with them.

If you want to play with the Enigma Tracker, fill out the graph-model-1 Invite Form and let us know you want to give the tracker a try.

How Enigma Uses Embeddings to Power Business Discovery

Enigma — Fri, 28 Mar 2025 00:00:00 GMT

Lead discovery is hard—even when you have the right data. Business attributes like industry codes, revenue, or location are helpful, but they often fail to capture what someone really means when they describe their ideal customer. "witch supply stores," "boutique hotels with a wellness focus," "startups with a minimalist brand aesthetic"—these aren’t easy to translate into precisely specified queries.

Enigma’s initial Lead Discovery tools gave our internal teams the power to work with business descriptions in a taxonomy, but it assumed deep familiarity with our data model to pull out nuance. Users needed help navigating it, and even for experienced users, surfacing the right set of leads was often a multi-step, trial-and-error process.

We asked ourselves: "What if you could describe your ICP (Ideal Customer Profile) in natural language and get back real, viable leads—instantly?"

When we built the Enigma Explorer, we weren’t just redesigning a UI—we were rethinking the experience of business discovery entirely.

That meant:

Accepting that customer descriptions often start fuzzy, not precise
Designing for iteration, not just execution
Supporting high-level intent with structured data, not forcing users to pick filters blindly

We realized pretty quickly that dropdowns and filters alone wouldn’t get us there. Our structured data is rich, but for many use cases, it’s not expressive enough. We needed a way to capture the feel of a business—not just its tax classification.

This is where embeddings come in.

At a high level, embeddings are a way to turn complex things—like websites, descriptions, or business profiles—into numbers. Not just any numbers, but numbers that reflect meaning. Similar businesses have embeddings that are close together. That lets us compare businesses in a way that reflects real-world similarity, not just shared NAICS codes.

Our in-house data model, graph-model-1, includes embeddings generated from a massive dataset of business websites and metadata. It doesn’t just look at the words on a site—it learns patterns about design, tone, structure, and more. It captures the latent characteristics of a business: is it tech-forward? Family-run? Sustainability-focused? Traditional or modern?

Using these embeddings, Enigma can compare a customer’s ICP description or reference list to every business in our database—and surface those that are semantically closest. Think of it as a similarity search for business identity.

How It Works

Here’s what’s under the hood:

100M+ business websites in our database
A custom scraping engine that continuously crawls and archives those sites
A 10TB+ historical web archive that captures changes over time
GPU clusters that generate embeddings at scale, processing millions of pages per hour
A vector database with optimized HNSW indices for real-time search (p95 = 5s)

This system allows us to compute and update embeddings at internet scale, and search across the full set in milliseconds. When a customer enters a prompt like "luxury pet spas," we can instantly return the closest matches—even if those businesses don’t explicitly use that phrase.

What It Enables

With embedding-powered discovery, our customers can:

Go from vague descriptions to real leads in seconds
Find businesses that look and feel like their best customers
Unlock segments that aren’t easily defined with structured filters
Combine traditional attributes (like revenue or location) with semantic similarity for precise targeting

It also opens the door for more intelligent iteration. Users can explore clusters, refine their definitions, and uncover patterns they didn’t know to look for.

We see this as the foundation for a much more flexible discovery engine. Embeddings let us model not just who a business is, but how it evolves. That means:

Tracking businesses as they change over time
Powering recommendations and lookalike audiences
Surfacing new leads as they emerge, based on real-time shifts in identity

Over time, we believe this will enable more proactive, personalized, and dynamic workflows—discovery that adapts to your strategy, not the other way around.

Start Exploring the Business Landscape

We’re just getting started, but embedding-powered business discovery is already transforming how our customers explore the world of small business. Want to try it for yourself? Sign up for the Enigma graph-model-1 research preview and get exploring.

Announcement: Enigma graph-model-1 (Research Preview)

Enigma — Tue, 11 Mar 2025 00:00:00 GMT

In late 2023, we began transforming Enigma’s data and machine learning infrastructure to capture the complexity of the U.S. business landscape in far greater detail. Today, we’re proud to introduce the new version of Enigma, graph-model-1, an evolved entity resolution framework and data model powering Enigma that maps brands, legal entities, operating locations, and the people who power them—all woven together in a 2.4B-node knowledge graph.

graph-model-1 is the most comprehensive, accurate, and expressive business data set

Enhanced entity linking & accuracy: graph-model-1 drastically reduces inaccurate entity resolution (false positives and negatives)—even in complex entity relationship scenarios where traditional systems often fail.
Expanded data coverage: now covering 49.8M brands and operating locations, 98M registered entities, and linking $7.1T in annual card revenues, we capture newly formed or fast-evolving businesses and operational changes in near real-time.
Brand entity embeddings for semantic filtering: identify and group businesses based on nuanced brand relationships, enabling more intuitive queries and segmentation across franchise networks or complex corporate structures.

what you can do with graph-model-1

Target franchise networks: Pinpoint franchisor–franchisee relationships by focusing on a parent brand, then discover affiliated locations or legal entities.
Discover emerging businesses: surface newly established companies using minimal input—such as a partial brand name or approximate location—and follow their trajectory as they evolve.
Monitor operational footprints: combine location intelligence with card revenue data to detect expansions, closures, or operational shifts—ideal for market analysis and financial forecasting.****
Segment by behavioral insights: filter entities based on semantic understanding of their online presence, transaction patterns, customer feedback, or other key indicators—without relying on rigid, predefined search parameters.

research preview

We’re currently onboarding close partners and technical collaborators to validate graph-model-1’s capabilities. Over the coming weeks, we’ll share more about the additional features we’re rolling out—and demonstrate how we can unlock new levels of insight for the hungry algorithms that drive decisions off of Enigma. If you’re interested in applying graph-model-1 to your business, our waitlist is now open and invites will be rolling out soon.

https://www.youtube.com/watch?v=aNKDBAIRIqU

Tracking NYC's Weed Bodegas

Enigma — Thu, 06 Mar 2025 00:00:00 GMT

We're bringing you some of our most interesting stories about the state of US Business we're seeing with Enigma data, as the data that's caught our eyes lately as we've been building Enigma's next generation platform, and sometimes a few tidbits of product news.

This month's big story: we're looking at the rise and fall of illegal weed bodegas in NYC through the revenue impact of an enforcement crackdown on their legal counterparts.

Weed Bodegas of NYC

Where the revenue tells the story.

If you’ve walked down the street in New York City over the past few years, you’ve probably noticed it: illegal weed shops are everywhere.

On blocks with commercial storefronts, it seems like there’s one–often several–store(s) selling cannabis without a license. Some are self-styled membership clubs, others occupy former bodegas and also sell candy and snacks, while some are weed-only shops with a hulking security guard and a welcoming “No Ski Masks Allowed” sign at the door.

What they all have in common is that they don’t have licenses to legally sell what they’re selling.

These grey-zone weed shops sprung up because four years ago, New York state’s 2021 marijuana legalization law created a clunky licensing process that delayed fully legal shops from opening, creating additional consumer demand without commensurate supply, while failing to clearly lay out who was responsible for shutting down unlicensed commercial vendors.

After a tweak to state law last year cleared up the enforcement confusion, New York City massively ramped up enforcement with a plan called Padlock to Protect. That revenue boost appears to be continuing, despite a legal challenge to the city’s enforcement program being successful (the City has said it will appeal the ruling and continue locking unlicensed shops).

Revenue data analyzed by Enigma shows that Padlock to Protect had a significant impact on legal weed sales, a clear indication that the program shuttered enough unlicensed stores that it pushed consumers to increase purchases from fully state-licensed vendors.

While both the total revenue and number of legal dispensaries continued to grow each month through the life of the program, there's a noticeable jump in revenue from May 2024 to June 2024, just as enforcement ramps up.

And that change is really something when you look at the monthly change in revenue per store: even as the number of stores grow there's a 45% jump in per-store revenue.

Meanwhile, we're still getting the full picture of what that enforcement campaign looked like. Data obtained by Enigma from New York City also shows that the city’s enforcement from May through August was heavily focused on Manhattan, with a high concentration of raids on the Upper East Side, the Upper West Side, the Lower East Side and the East Village:

This heatmap displays inspections/raids done by the NYC Department of Consumer and Worker Protection and the New York City Sheriff’s Office as part of Operation Padlock to Protect between May 2024 and August 2024; data for inspections/raids done by the Sheriff’s Office alone as well as those performed between September and December 2024 are not available.

Really makes you want to know who you're doing business with. Enigma's here to help you with exactly that!

Lets check-in on some datapoints:

The average price of a dozen eggs is up 7x from two years ago - U.S. Dept. of Agriculture
The Ford F150 Lightning boom is definitely over - Bloomberg
Hurricanes are killing in new ways, as freshwater flooding rises - NY Times
Who buys from AliExpress, Temu, and Shein? Disproportionately, low-income Americans - NBER
Climate-induced shortages have caused cocoa prices to surge nearly 50% since Halloween - Trading Economics

Stay tuned, we've got more data stories (and some pretty big developments out of Enigma) coming out of Enigma in the next few weeks.

The Ozempic Craze - Weight Loss Industry Same Month YoY Revenue Growth

Enigma — Fri, 01 Nov 2024 00:00:00 GMT

Top weight loss players offering Ozempic have shown consistent YoY growth since 2021. These spikes came immediately after various GLP-1 drugs were put on the FDA Drug Shortage List, thus allowing various compound pharmacies to make alternative GLP-1 drugs and sell them directly to consumers.

Traditional weight loss players offering meal plans such as Jenny Craig or Weight Watchers have suffered amid this Ozempic craze. On the other hand the growing providers of Ozempic include compounded pharmacies like Shed-RX and DTC providers of weight loss drugs made by compounding pharmacies like Henry Meds.

Enigma's data can help you learn more about emerging industry trends to help you find and target high-growth prospects.

Hicham Oudghiri & Karen Mills on Fintech and Small Business Lending

Maile McCann — Thu, 31 Oct 2024 00:00:00 GMT

SBA administrator, HBS fellow, and author Karen Mills and Enigma CEO Hicham Oudghiri chat about the launch of the second edition Mills' book, "Fintech, Small Business & The American Dream." Hear more about their thoughts on topics like:

How the 2008 financial crisis prepared Mills for the Covid-19 crisis in 2020
The differences in SMB data and tools in the U.S. vs other countries
The role of banks in an increasily modern SMB lending landscape

The State of Small Business Today: An Interview Between Karen Mills and Diana Ransom

Maile McCann — Wed, 30 Oct 2024 00:00:00 GMT

Inc Executive Editor Diana Ransom interviewed former SBA administrator, HBS fellow, and author Karen Mills about the state of the small business economy at a recent Enigma rooftop event. Learn more about their thoughts on:

The modern role of the SBA & how the organization handled the Covid-19 crisis
How technology can help solve the SMB funding gap
The role of AI in the SMB economy

Data Driven Insights Into The SMB Restaurant Landscape

Maile McCann — Fri, 25 Oct 2024 00:00:00 GMT

The restaurant industry took a major dip in revenue growth in 2020 amid Covid-19 shutdowns and – despite a rebound in 2021 – restaurants struggled to find their footing in 2023.

Using our card panel data of over 40% of transactions in the US – including a longtail of small and medium sized restaurants, Enigma wanted to examine the state of the restaurant industry today. We look at everything from what drink trends are cool today to how different geographies continue to be affected by Covid-19 to market share of different sized restaurant chains.

Second Tier Cities Are Number One

Smaller US cities fared better than larger ones during Covid-19 and continue to grow at faster rates today.

Denver experienced a Covid-19 boom, outperforming 2017 restaurant revenues even in the height of Covid-19. In 2023, revenues for the city were up, but nowhere near mid-2020 and late-2022.

Denver was something of an outlier during this period - most other large or growing cities saw their restaurant revenues fall to ~40% of their January 2017 index during Covid, rebound over the next year to well above pre-Covid totals, flatten for 2 years, and decline slightly in 2023.

Chain Dominance

It’s a great time to own a chain restaurant – or to offer them services.

Less than half of restaurants in the US have multiple locations, however, these chains take up an outsized share of Gross Processing Volume (GPV), or the total value of transactions that pass through a payments system.

Businesses with 10+ locations account for 52% of total restaurant GPV in the U.S. while single location restaurants win only 29% of total restaurant GPV.

Fine Times for Fine Dining

Fine dining is finding a foothold despite economic pressures.

While low-cost and mid-cost restaurants are struggling to grow beyond pre-pandemic highs, restaurants with an average ticket of $75-$100 and over $100, doubled their share of total restaurant revenue from 2017.

Bubble Tea Bubbles Up

Bubble tea leads the drinks pack. Average annual revenue growth of bubble tea shops outpaces wine bars and juice shops.

Leading tea peers in revenue growth are i-Tea, Boba Love, Tea Top, Da Boba, and Tea Time.

The Infatuation Effect

While restaurants in cities across the US experience middling or negative growth, restaurants featured in The Infatuation’s Top 25 Restaurants lists outpaced them. This also holds true across pricepoint: only high-end restaurants in San Francisco and Los Angeles beat out The Infatuation's Top 25.

This also holds true across pricepoint: only high-end restaurants in San Francisco and Los Angeles beat out The Infatuation's Top 25.

Better Target, Segment and Engage Restaurants

Enigma's data can help you learn more about the near-real-time revenues, locations, payment technologies and more of restaurants of any size across the US to help you better find and engage high-growth prospects.

The Ozempic Craze - Largest Price Increase in Beauty and Personal Care

Enigma — Thu, 24 Oct 2024 00:00:00 GMT

Average ticket size of weight loss providers grew more than any other beauty and personal care service, amid explosive growth in programs using Ozempic and other semaglutide medications. This weight loss fascination was consistent across the US.

The top-growing providers of these weight loss services include compounded pharmacies like Shed-RX and DTC providers of weight loss drugs made by compounding pharmacies like Henry Meds.

Enigma's data can help you learn more about emerging industry trends to help you find and target high-growth prospects.

Enigma Product Updates: Q3 2024

Eliza Cooke-Yarborough — Thu, 24 Oct 2024 00:00:00 GMT

In Q3 2024, Enigma celebrated the one year anniversary of our Enigma KYB & Onboarding product.

We launched our KYB product in October 2023, amid rising payments fraud and increased regulatory scrutiny for our financial institution customers. These trends continue today, making robust and proactive AML processes more critical than ever.

At launch, early adopters of Enigma KYB saw 1.5X higher instant verification rates compared to other KYB providers and reached up to 80% savings in onboarding costs. Today, our customer base has tripled and business match rates have risen to over 85%. Learn more about the updates we’ve made in the last quarter to our KYB product, and how we can serve your risk and compliance needs.

Q3 2024 KYB Updates

In Q3, our primary focus was improving instant match rates by updating our matching algorithm so our customers could instantly verify more customers and reduce manual reviews. Our changes focused on creating better rules to handle legal identifiers in business names (e.g. “LLC”) to make it easier to find matching entities for slightly differing inputs. We avoided creating false positives in this process by pairing these broadened name searches with validation through information like matching addresses to ensure the correct match was returned.

With these changes, we saw match rates improve by 17% from 73% to over 85% on average. Some customers are seeing match rates consistently above 90%.

We also improved our business address verification processes to accommodate minor discrepancies in street numbers and close proximity addresses.

Finally, we are creating additional flexibility in our product, so customers can choose the packages and data attributes they need to onboard with confidence. Tasks are our modular data building blocks that help you determine whether a business you submit is valid and meets your specific KYB requirements or requires manual review. In Q3, we created two new Tasks that verify business names and addresses solely against Secretary of State (SoS) filings – vs. SoS filings and other data sources – for customers whose KYB processes center around SoS as a single source of truth.

We’ve also removed our watchlist verification process – screening UBOs against OFAC watchlists – from our standard Verify package to reduce upstream dependencies and decrease costs. Instead, watchlist verification is now offered as an optional add on for maximum flexibility.

Enigma KYB & Onboarding Today

These new features add even more flexibility and ROI to our existing KYB & Onboarding product which includes features like:

Address and name verification,
Secretary of State (SoS) business registration filing verification,
Taxpayer Identification Number (TIN) verification,
People and UBO verification,
Industry and risky activities review, and
Watchlist screening.

We aim to streamline your KYB process, help you onboard more customers, and meet regulatory requirements with less overhead and more savings. We offer both direct integration as well as integration into your broader data orchestration platforms like Alloy or Taktile.

Want to learn more?

Current customers can reach out to their CS representatives with questions and feedback. If you’re new to Enigma and interested in our Sales & Marketing or Onboarding & KYB products, please get in touch.

Beauty and Personal Care - Digitalization of Beauty Services

Enigma — Mon, 14 Oct 2024 00:00:00 GMT

Digitalization of services is increasingly driving buzz in the beauty industry, but the % of stores providing online salon booking and % of stores that have any website at all vary by service type.

Hair extensions, facial spas, and medspas are most likely to have both, but are overall a small percentage of total beauty locations. Hair salons, hair removal services, nail salons and barber shops, meanwhile, have numerous more stores in the US, but fall behind in digital sophistication.

Enigma's data can help you learn more about the technologies SMBs use to help you better target, segment and engage prospects and customers. in the industries you care about.

Beauty and Personal Care - Beauty Salons Monthly Opening and Closures

Enigma — Tue, 08 Oct 2024 00:00:00 GMT

The beauty salon industry hasn’t fully recovered since Covid-19.

New salon openings have dwindled in 2023 and 2024, while closures of existing beauty salons (e.g. salons with no revenue in a three-month look back period) have been consistently more frequent than pre-pandemic.

Enigma's data can help you learn about openings, closings and the financial health of businesses across different industries.

Beauty and Personal Care - Personal Care Industry Revenue

Enigma — Thu, 03 Oct 2024 00:00:00 GMT

After a large dip in monthly revenue growth during peak Covid, personal care has rebounded beyond pre-Covid levels. Consistently, the category drops during January of each year and rises during summer and the December holiday season.

Enigma's data can help you learn more about the near-real-time revenues of SMB accounts across the U.S. in different industries to help you better target, segment and engage prospects and customers.

The Changing Financial Crime Landscape Part 2: TikTok Scams, Alternative Data, and Better KYB

Maile McCann — Mon, 30 Sep 2024 00:00:00 GMT

According to Nasdaq, money laundering reached $3.1 trillion globally in 2023, funding human trafficking, drug trafficking, and terrorist financing. Losses from fraud, meanwhile, reached $485.6 billion. Moreover, financial regulators have been hit with 80 fines in H1 2024, totaling over $263 million or a $62 million increase from the same period the previous year. Amid this, financial institutions need to educate and protect themselves against increasingly sophisticated – and expensive – fraud.

Enigma chatted with Luke Raven – an expert in AML, compliance, and financial crimes – about how financial crime has changed over time in Part One of this two part series. We continue our conversation here on the unique strategies FIs have implemented to prevent financial crime, the rise of no-doc loan scams, and moving beyond ticking boxes in KYB and KYC compliance.

This interview has been edited for clarity and length.

Should FIs think about their KYB process within the context of meeting regulatory requirements - ticking the box on a standard set of data collection – or can compliance processes be more? Can FIs build better AML processes without creating more customer friction?

I care the most about financial crime, but that's why I will never be the CEO of a bank, because CEOs have to balance a lot of things. It doesn't make sense to have a whole product suite that you've spent millions of dollars to develop, a whole marketing campaign that you spent millions of dollars to roll out, and then you say, well, yeah, but we're only going to allow 5% of people to come on board that are willing to go through a higher friction onboarding process. I care deeply about financial crime, and I can sit in this space and say, “that's not even going to stop your bad guys, not your sophisticated bad guys, anyway.”

You need to balance business and compliance. And a lot of the time in the past, we've got that wrong and just focused on all business, no compliance. Instead, you need to have something where you can check the boxes, allow people on, and understand that KYC and KYB are the first step in a long process of weeding out bad guys.

So if you can turn away bad guys at the start, that's fantastic, but KYB is less around that, and in my view, and more around understanding applicants so that you can then apply appropriate monitoring to that customer. If you learn more at KYB, you'll have much less friction throughout that relationship, and you can even place customers appropriately to lower risk and spend less money on monitoring them.

Have you seen more institutions getting creative with their KYB processes to address unique vulnerabilities (using alternative data such as revenue over time, etc.) vs. ticking the box? Do you have examples?

What I've seen, and I'm really excited by, is a new kind of trend: iterative KYC/KYB. For example, let’s say I – Luke – am a business owner. So first we box check: we know that Luke owns the company, it’s a company that sells wooden statues at the market, and we know Luke isn’t sanctioned. Then you wait and see, because a lot of the time, it's gonna be fine and you don't need to really worry about that customer. But then you would monitor the gross payments volumes processed through the account and do a more in depth review (after unusual behavior). That, to me, is a great way of layering on friction, but only when it's necessary, because it means that your upfront customer doesn't necessarily have those later layers of friction, and we only apply it as it becomes understood to be necessary It makes more sense to do due diligence iteratively.

We’re seeing a rise of no-doc loans scams on Tiktok – where users are suggesting taking advantage of low friction KYB/KYC processes. What are your thoughts on banks offering no-doc loans and their viability for financial institutions?

Start with the fact that Tiktok is just an absolute breeding ground for scams, right? So a lot of the time what you'll see on there isn't real. But there are exciting things banks can do with technology or with a voracious risk appetite. You can do things that you can't do with a conservative risk appetite or less impressive technology.

My job as a compliance person is never to sort of say what we should or shouldn't do from a product perspective. But look at something kind of analogous: Buy Now, Pay Later. As an industry a little bubble that sort of popped up… and it was essentially a no-doc loan, because it had nothing to do with credit worthiness and there were no credit checks. The whole idea with afterpay and buy now pay later was that it's going to change the world. It's going to eat debit and credit cards alive. Afterpay was supposed to be this massive, disruptive thing. But right at the moment, it's born a bunch of competitors that went out of business. It's still a tiny percentage of the lending ecosystem as a whole.

So my question (for no doc loans) would be, why? Is the business case there? Because often these flashy fintechs will will launch, and the idea is like and they get amped up. Everyone loves it. VCs have a huge part to play in it. I think that the growth isn't there for these yet. But the stuff that makes me impassioned about these things is financial inclusion. You've got that broad risk sitting right next to financial inclusion. It's a worthwhile endeavor. It's worth it's worth looking at.

Enigma offers third-party data on businesses to help FIs understand businesses’ financial health before they even apply. Can alternative data like financial products that reduce friction or increase inclusion safer?

I really do think so. The same startup that I learned about the iterative KYB approach had something really interesting where they would serve an underbanked segment of the market, get comfort over time, and then they would sell that data to banks. The banks

banks found that the loans they initiated with the backing of that company that had taken a chance on them had significantly higher repayment rates. I don't think anyone should underestimate the transformative power of this kind of thing.It is very, very much possible. I think that some tremendous percentage of the world is still basically unbanked in terms of lending services and it's more of a data problem than anything.

Enigma offers third-party data on businesses to help FIs understand businesses’ financial health before they even apply. Do you think this has value for a FI’s KYB process?

That's a huge competitive advantage for a business to have. I think it's a tremendous advantage to have KYB done automatically as opposed to partially automatically and as opposed to manually. It's just nuts the amount of money that we spend checking boxes when there are fantastic, vendor driven solutions in market that banks can rely on.

Enigma also partners with companies like Alloy to serve as one data point of many to automate KYB in what we describe as a data waterfall. What are your thoughts on combining multiple solutions in KYB.

I like the waterfall approach. Waterfall is an interesting way of putting it – I've never thought of it that way before. Some of my favorite businesses are aggregators, and I think that it's a really efficient way to launch and test. Because at the moment, if I want to test a reg tech provider for transaction monitoring or a data source for KYC/KYB on my own, I have to go and build a process and or an API integration, probably both.

Whereas I can be more experimental if I have (a waterfall provider) that can say let's run both of these (sources) in parallel, and I'll just switch off whichever one's less performing. I think that that's the kind of experimentation and technology adoption that banks need to look at if they want to remain competitive.

Learn more about how Enigma can help your institution combat financial crime – with lower overhead and more automation – or check out Part One of our conversation with Luke Raven.

The Changing Financial Crime Landscape Part 1: Online Crime, Rising Fines, and Data Breaches

Maile McCann — Mon, 30 Sep 2024 00:00:00 GMT

Enigma chatted with Luke Raven – an expert in AML, compliance, and financial crimes – about how financial crime is changing, the rise of online big butchering schemes, and the efficacy of fines on FIs to combat financial crime. This is Part One of a two-part interview with Raven, check out Part Two here.

This interview has been edited for clarity and length.

Can we chat about the differences in financial crime five years ago compared to now? How have rates been rising and how have the types and sophistication of those crimes changed?

Money laundering generally flows from an offense – you commit a crime, you generate proceeds of that crime, and then you need to make it look legitimate. And I think that the Covid-19 pandemic had such an impact on businesses – including organized crime groups.

So predicate offenses have changed. When you said to me 5 or 10 years ago, “Hey, Luke, what bad guys are you after? I was after drug dealers, cartel members, organized crime groups, and money laundering syndicate. And then came Covid-19 and it forced all of these sort of old school criminal empires to adapt and to make themselves online businesses.

Fraud and scams have become such a prolific topic. We had a downgrade in the losses last year in Australia to $2.7 billion from over $3 billion from scams the year before, but that is still five times as much as it was five years ago. It’s slapping people in the public in the face: everyone knows someone who’s been scammed and everyone’s worried about their grandma or their kids or their cousins or their mom.

It's very interesting, because at the same time as that has happened, all of the financial services have evolved over this time as well to be more online, more digital, and constantly making leaps and bounds in that regard. So that progress enables that cross-border moving of funds more instantaneously. Everyone wants to remove friction from onboarding because, of course, there's competitive advantages in offering your product to market with the least amount of clicks and the least amount of friction for customers – delivering Ferraris compared to horses and carts in terms of the capabilities you get – but that includes as a money launderer.

So I think it's chalk and cheese. Obviously, there are still massive drug problems and organized crime groups, but they're increasingly turning to cyber crime and we need to turn our minds to that. And it's fascinating, as well, because five years ago, I was very easily able to root out amateur money launderers, someone dealing a small commercial quantity of drugs, for example. It was easy to find them, because it was all cash. But now we are dealing withwith organized crime groups, unorganized crime groups like solo attackers, and even nation states sponsored groups all in the same realm. It’s an incredible transformation and I think that we really need to get our skates on as an industry to keep up with it.

Have you seen any particular interesting examples demonstrating the changes in financial crime?

We're living in the age of data breaches. That that is the new trend. You no longer rob a bank anymore to get a bag full of cash and a bullet for your troubles. You go online and you attack the bank’s customers. Big butchering scams – and I hate that term – are very well known already, but are gaining impetus.

This rise in money laundering seems to be corresponding to a rise in fines on financial institutions – is this a good way to combat this sort of crime?

I don't think it is effective. The problem with it comes from the word itself: compliance is something you do because you have to do it. If that's the only reason you're doing it, you typically try for a least-compliant product approach, 50 out of 100. The problem with organizational goals like these is that it's much easier to get an A+ by aiming for 100 out of 100 then falling slightly short. But many are aiming for bare minimum.

So I'm a huge proponent of meaty fines when they're appropriate, but they really should be aligned with intentionality, deliberate flouting of the law. You've had some really interesting cases in the US recently – the world's largest crypto exchange and things along those lines – but I think that that is really appropriate. But what's lacking in terms of the industry overall, we've got all the we've got, we've got these sticks, right? Where are the carrots?

Where's the incentive to improve if someone is doing an okay job and they're probably not going to get fined or even if they're doing a pretty subpar job but it's not egregious. You can look back at the fines against some of the world's largest institutions, they're repeat offenders because they don't really have an incentive to this. They're just trying to do it well enough to not get whacked right now, and I think they need a carrot from the regulators.

We haven't seen a lot of creativity in this space from legislators and I think that we need it. The other thing that feeds into this is that fines would be a more effective mechanism if we really, really funded our public sector more. I saw a recent article that the US compliance authorities are going to have a $2 billion budget – including the IRS and everything – going forward which is great and it sounds like a lot to a layperson. But that's budget pales in comparison to the financial services industry. So it's a little bit like roulette.

If you're a senior banking executive, you have a million priorities to balance: you need to grow your customer book, you need to return value to shareholders, and you have these historical compliance issues. What incentive do you have to be the person that says, “You know what? Knock it all down. Let's start again. Let's do this. It's going to cost money, it's going to cost customer friction, but we're going to be leaders in the space of financial crime.” An executive could do that but why would they when they can just go, “Well, I didn't do this and it's not great, but we probably won't even get found out.” I think that that's a real problem with this enforcement approach.

Are there carrots for FIs to fight financial crime beyond what the government provides them? Why should FIs care about investing in better AML processes and can they make AML a competitive advantage?

I think that's a great question and it goes back to my initial answer about how fraud has become much more prevalent. I tend to think that trust is the new customer service. But trust and safety are only competitive advantages if you make them that way: if I never get defrauded, and I never suffer at the hands of bad guys and I don't blame my bank, then potentially I don't even realize that lack of friction in my life. It only becomes a problem when it becomes a problem and then I complain.

But what banks are moving to more and more, is talking about AML because they have to invest all this money in it anyway. And what I see is a really virtuous circle here, where talking about it more, brings more awareness to the public. Customers start to make decisions based on where they find their money is safe and they are more understanding of compliance challenges. This strategy is the kind of thing that I think that banks need to steer into, but it is an intentional and a strategic shift that you have to make if you want to make financial crime a competitive advantage. It doesn't become one by doing it silently.

Learn more about how Enigma can help your institution combat financial crime – with lower overhead and more automation – or check out Part Two of our conversation with Luke Raven.

Beauty and Personal Care - Annual Spend on Beauty Services, by Person

Enigma — Wed, 18 Sep 2024 00:00:00 GMT

Annual beauty spending per person varies greatly by state. While Bay Staters are willing to spend $337 on beauty services annually, Wyomingites keep their beauty costs under $100 on average.

Enigma's data can help you learn more about average ticket size across different states to help you better understand retail segments and target and tailor offers regionally.

Beauty and Personal Care - Competitive Penetration of Salons by State

Enigma — Wed, 11 Sep 2024 00:00:00 GMT

Colorado, Florida, Washington, Oregon, and New Hampshire have the highest number of salons per a thousand residents – between 167 to 183 salons.

The least salon-saturated states are Indiana, Minnesota, Kentucky, Mississippi, and West Virginia, ranging from 88 to 113 salons per a thousand residents.

Enigma's data can help you learn more about the landscape of businesses across different cities and states to help you better target and tailor offers regionally.

Beauty and Personal Care - Beauty Services Average Ticket Size

Enigma — Thu, 05 Sep 2024 00:00:00 GMT

We have seen a dramatic 35% increase in spend per beauty service visit from January 2019 to January 2024 in the US. This category is largely focused on hair, nails, makeup, and spas.

Enigma's data can help you learn more about average ticket size growth or decline to help you better target and tailor offers to industries and customers that matter to you.

Enigma Product Updates: Q2 2024

Eliza Cooke-Yarborough — Thu, 01 Aug 2024 00:00:00 GMT

Two quarters after the initial launch of our KYB product, we’re continuing to search for ways to make onboarding businesses more efficient and less risky for our customers. Over the quarter, we adjusted our Identify package to include legal entity addresses, names, and people information to make identification of businesses easier for our customers. We also added a new package – KYB + TIN Verification.

TIN Matching

Since launch, we’ve heard the need from our customers for validating Taxpayer Identification Numbers (TINs) to reduce the risk of onboarding fraudulent businesses. All valid traditional businesses entities will have an Employee Identification Number (EIN) issued by the IRS, and these are granted at formation. Sole proprietorships without EINs, meanwhile, can use their SSN or ITIN as their TIN. Because these TINs are granted at formation, they are a valuable tool for screening businesses early in the onboarding funnel.

Enigma now offers TIN screening to assist our customers with this component of KYB. We focus our efforts on entities with an EIN. When running a TIN verification we can either successfully approve a match or, if the match fails, communicate that a business has either given you a Business Name and TIN that don’t match IRS records or a TIN that does not exist.

Q2 Sales & Marketing Updates

Talking to our customers we hear two common themes come up when discussing how to run effective sales and marketing campaigns; they need to narrow in on the right prospects to sell to, and reach out to them at the right time.

Card Not Present Revenues

When narrowing in on the right prospects to sell to, our customers also need to be able to prioritize the highest value prospects for them within their universe of potential leads. Enigma’s card revenue data, sourced from a panel of more than 40% of all US consumer credit and debit card transactions, is a powerful way to prioritize high value leads. However, a question we have often been asked by customers interested in US ecommerce retailers is what % of these card revenues are made up of online spend. While website traffic data points such as page visits can provide directional insight, they cannot capture other important drivers of increased online spend such as higher average transaction size.

This quarter we have productionized card not present revenues as an additional attribute to help our customers prioritize the highest value leads within their e-commerce ICP. Card not present revenues include card revenues from online transactions, phone orders, mail orders, purchases made with a card on file - such as a subscription or membership-related spend, and invoice payments. Card not present revenues are not equivalent to online spend - and in some businesses will be very far from it. For example, a golf club allowing members to place charges for meals, lessons and shop purchases to the card on file for their membership will have very high card not present revenues, but a good proportion of these may be offline payments. However, for customers selling to traditional B2C retailers within the subset Enigma has already identified as having online payment capabilities, card not present revenues can be used to help prioritize leads, as a proxy for online revenues from purchases made on the business’ website. This helps improve the efficiency of our customers' sales campaigns, and ensure they are focusing on the highest value e-commerce retail leads first, or directing these leads to the relevant sales territories.

New Business Triggers

In addition to our range of attributes for segmentation and prioritization, Enigma also provides triggers for sales and marketing campaigns. One of these is our new business trigger, which identifies and generates a lead list of businesses that have launched within the last year. This helps our customers who are selling a product that is either most relevant to newly formed businesses, or to businesses that are unlikely to change from an initial provider. This is often relevant for core payments software and organizational software products. Enigma serves these customers by identifying when a business was first launched (via registration filings), and if and when they have started selling to customers (via our consumer card panel).

Last quarter, we talked about applying a fine-tuned BERT model to the text on a business’ website to accurately predict the industries of businesses in our product. This quarter we applied a similar model to the businesses in our new business trigger marketing list, and we now have an industry for around half of the businesses in our new business trigger list. This helps our customers include or exclude certain types of businesses from their new business trigger campaigns, improving the efficiency of their marketing campaigns.

Want to learn more?

Current customers can reach out to their CS representatives with questions and feedback. If you’re new to Enigma and interested in our Sales and Marketing or Onboarding and KYB products, please get in touch.

Retail - 2023 US Revenue Growth

Enigma — Tue, 16 Jul 2024 00:00:00 GMT

Top-growing retail subcategories vary by state. Consumers are fixing cars, rather than buying new ones amid a chip shortage affecting new car production: California saw revenue from motor parts dealers rise 15%, while Ohio saw tire dealer revenue growth up 81%.

In Michigan and North Carolina, meanwhile, accessories posted outsized growth.

Enigma's data can help you learn more about revenue growth in different cities and states to help you better target and tailor offers to the prospects and customers that matter to you.

Retail - Top Growing Retail Subcategories

Enigma — Tue, 09 Jul 2024 00:00:00 GMT

Pharmacies and motor vehicle dealers led the pack in US retail growth in 2023 at 5% and 4% respectively. Warehouse clubs, used car dealers, and women's clothing stores round out the other top five categories.

Enigma's data can help you learn more about growth of revenue across sub-industries to help you better market towards growing segments.

Retail - 2023 YoY Organic Grocery Revenue Growth

Enigma — Tue, 14 May 2024 00:00:00 GMT

Organic and healthy grocery stores experienced outsized growth during the pandemic with YoY revenue up as much as 15% in Vermont in 2020 and 16% in Minnesota in 2021.

While average growth slowed in 2023, Vermont and Minnesota continued to be pack leaders and Iowa, Florida and Oregon also saw an acceleration in this category.

Enigma's data can help you learn more about revenue growth of businesses in niche industries in different regions to help you better target, segment and engage prospects and customers.

Retail - % of Category-Specific Businesses with E-Commerce Channel

Enigma — Tue, 14 May 2024 00:00:00 GMT

Post-pandemic, apparel companies are continuing to invest in e-commerce. However, not all apparel categories are created equal.

More apparel businesses focused on bags, swimwear and baby clothing offer an online channel. Tuxedo, bridal, and orthopedic shoe businesses, meanwhile, are less likely to offer online shopping for their specialty products, with under 5% of these sorts of businesses offering e-commerce services through their own brand channel.

Enigma's data can help you learn more about the online/offline revenues of businesses across different industries to help you better target, segment and engage prospects and customers.

Retail - San Francisco Retail Closure and Shoplifting Rates

Enigma — Tue, 14 May 2024 00:00:00 GMT

As shoplifting increases, store closure rates rise. Enigma’s data – tracking both store closures and openings – found peaks in closures in late 2021 and late 2022 in San Francisco. These closures follow a similar pattern to the Council on Criminal Justice’s shoplifting data in San Francisco.

In November 2021, for example, store closure rates doubled month-over-moth while shoplifting increased from 38% to 60%. In December 2021, however, both closures and shoplifting significantly decreased to 4% and 42%, respectively.

Enigma's data can help you learn more about the businesses that are opening and closing in the industries you care about and can be used alongside other data sets -- like the Council on Criminal Justice's shoplifting data -- to help you get a better picture of a business, industry, or city.

Retail - Top Payment Processor For Selected Retail Area

Enigma — Tue, 14 May 2024 00:00:00 GMT

Payment processor dominance varies by retail category. 46% and 34% of clothing stores, for example, use PayPal and Shopify, respectively, while under 10% use other processors like Adyen, Braintree, Square, or Stripe. However, in electronics, Stripe is the second most popular processor behind PayPal.

Enigma's data can help you learn more about payment processor penetration of businesses across industries to help you better target and tailor offers to prospects and customers.

Introducing Enigma Risk and Underwriting

Enigma — Tue, 07 May 2024 00:00:00 GMT

According to the Small Business Administration, two out of three business owners who seek credit do not receive what they need. One of the biggest barriers is surprising: not bad applicants with high risk, but instead sparse, thin, and inaccurate data on applicants.

What is Enigma Risk and Underwriting?

Enigma Risk and Underwriting offers accurate intelligence about the identity and financial health of 16 million + businesses to help you dynamically manage risk. Enigma Risk and Underwriting is a pre-permissioned data set. Enigma requires zero opt-in and can be used across any portfolio of business applicants.

Why ask a small business for their bank account information - and lose 70% of your applicants – upfront? We help you facilitate an offer to your small business prospects, before asking for additional information. We help you see that prospect’s card transactions from a third party, versus what the merchant itself chooses to provide to you.

The results with new customers and evaluations are clear: Enigma can help fuel profitable growth by approving 20-30% more applicants and granting higher credit lines to those approved. Our data is also used to cut out pockets of riskier populations that our clients previously had no insight into. The coverage of our data is high with the ability to match to 70-80% of our client's portfolio.

We support the efforts of business lenders with:

Prequalification:
Business credit card & loan approval decisions: By using Enigma’s firmographic and merchant transactions attributes, credit card and line issuers are able to predict delinquency prediction before signing on customers.
Credit line issuance: For less risky customers, Enigma helps you offer larger credit lines with more confidence.

For payment processors we help with:

Merchant cash advance underwriting: By using Enigma’s card revenue data, MCA underwriters are able to give larger advances upfront without the need for a long history of processing data.****
Chargeback risk and friendly fraud: By using attributes like the presence of an Enigma match, customers can better understand chargeback risk and friendly fraud.

How Might Enigma Risk and Underwriting Help You?

We’ve worked alongside our customers over the last year to understand how a general data asset like Enigma’s can aid with a variety of use cases. We understand that different features of our data asset have different importances across different products and businesses, and that some feature transformation is necessary. We’ve done that exploration ourselves and have it in our evaluation guide, so you can accelerate your evaluation. We explore the features that seem to have the most signal across a variety of use cases below.

Use Case 1: Business card and line issuers seek to decrease delinquency and offer higher credit lines

Business card issuers within large financial institutions have a difficult time underwriting business cards for applicants - usually, they must rely on personal guarantees from the business owner and information from the credit bureaus and existing accounts.

We met with some of our customers in this space – three of top ten business card issuers issuing lines between $5,000 and $100,000 – in order to see if we could help them decrease chargeoff rates of applicants with thin files and offer higher credit lines to lower risk customers.

For this use case, these features were found to be the most powerful:

Average monthly revenue
Average transaction size
Months of merchant transactions history
Months since last calendar year’s lowest revenue month

By using Enigma’s data, our customers were able to approve ~20% more small businesses for which there were thin files, without asking new applicants to connect bank accounts. Within all approved accounts, Enigma features were used to segment 15% of the population with 57% lower chargeoff rates - a population ripe for greater line sizes.

Use Case 2: Payment facilitators and processors extend merchant cash advances with Enigma

Processors that offer MCAs generally need at least 6 months to a year of history prior to underwriting a loan - even then, they may only see a portion of the card revenues of a business (e.g. a business might use different processors/payfacs for online vs. offline transactions).

Multiple large payment facilitators and processors that offer merchant cash advances came to Enigma with the goal of offering larger loan size for brand new customers and determining the probability of default for brand new customers.

Combining a variety of product features, we landed on average transaction size and months of merchant transactions history as two key signals for underwriters to write larger loans sooner for business clients, as well as use this knowledge to turbocharge pre-qualification and prospecting efforts. All the major processors use us as a core data set for revenue-based underwriting or MCAs, with one saying “We built our core MCA model with your data.”

Use Case 3: Payment facilitators and processors seek to decrease chargeback and friendly fraud risk

Non-lending credit risk incidents, including chargeback risk and friendly fraud, put wholesale ISOs, payment facilitators, and major card processors at risk with card networks. Moreover, if the business shuts down, the merchant processor may be on the hook for chargebacks instead. This risk is especially high during initial onboarding, when the processor has very little information about a particular merchant, especially as onboarding processes become increasingly automated.

Payment facilitators and processors of different sizes need Enigma data to help reduce non-lending credit risk for new merchants (e.g. chargeback risk or friendly fraud). For this use case, these features were found to be the most powerful:

Average monthly revenue
Revenue and transaction growth trends
Refund amount
Name of previous processor
Performance against benchmarks by industry and geography

These features were effective at splitting risk and the results were intuitive. At a high level, we found that a subpopulation comprising of half of all onboarded merchants within the last two years was found to have 40-50% higher chargeback incidence rate if they had no match on any business OR no credit card transaction presence in the last 12 months. More granularly, a subpopulation comprising ~1% of the total was found to have 5x the credit risk incidence rate if they were found in certain Enigma-provided industries and an abnormal amount of card refunds.

Not So Friendly Fraud: The Growing Chargeback Problem

Enigma — Tue, 23 Apr 2024 00:00:00 GMT

Friendly fraud or chargeback fraud is when a cardholder disputes a purchase on their transaction statement, despite the transaction being legitimate. Friendly fraud can occur in a multitude of ways – from a consumer who reports a package missing only to find it a few days later (and fails to report that the first package was delivered) to a customer who regrets a high price-tag purchase and disputes the charge to get a return.

Friendly fraud is quite common – 1 in 4 customers openly admitted to engaging in friendly fraud in their chargebacks — and rates are rising: a 2023 survey of retailers reported a 19% year-over-year raise in the practice.

Enigma chatted with Ashley Isenberg, leading payments and fintech advisor and former Finix alum, about how friendly fraud affects payment processors, why rates of friendly fraud might be rising, and how Enigma’s data might be able to solve the problem.

This interview has been edited for length and clarity.

A Friendly Q&A

Can you give us a few examples of friendly fraud?

Friendly fraud is when somebody intends to make a transaction and then later makes the decision to charge back that purchase. Think about someone signing up for a subscription they decide they don’t want and instead of reviewing the membership policy and asking for a refund, they just call the number on the back of their card to dispute the charge. Or if somebody overspends that month, their solution is to go back to their ecommerce purchase – something that they did not buy in person, something that was delivered – and charge those transactions back. There's also niche cases where someone may have paid for something that's a must-pay industry on behalf of somebody else, and now they're charging it back.

Step-by-step, what happens after a consumer initiates a chargeback, kind of what are the series of events that happen after that?

The cardholder or the consumer will call the number on the back of their card or will log into their bank online and they will get a cue to say “I want to dispute this charge.” When a consumer is going through their bank’s site, or calling, the bank will ask them a series of questions like “did you lose your card?”, “Did someone else take your card and spend this?, or “Did you not receive the product?” The consumer will identify one of those reasons, declaring product not received or poor service and then the card issuer is going to start the request.

What's challenging about chargebacks is while the issuer is starting the request, the acquirer has no visibility to that transaction, so the ability to acquire transactions and issue transactions happens on two sides of the networks and those two sides don't communicate. By the time the merchant will get a notice that says this chargeback is being disputed, they already have to provide evidence.

So the evidence is extremely important: how you collected the transaction, timestamp, signed agreements, proof of delivery, a picture of that product delivered. The merchant has an opportunity to send that evidence back, and the issuer will review it and say either “we don’t agree with the cardholder, there is no transfer of liability to the merchant” or “we agree with the card holder and we’re going to transfer the liability to the merchant.”

At that point, the funds are typically already with the processors and out of the merchants account. If they are found liable, the funds just won't process back. If they are found not liable for the transaction – that it was a good transaction – then the funds will be moved back into their account.

Friendly fraud sounds like it could just be a problem that merchants have to solve with their consumers. Why are processors and payments companies so concerned about this?

One reason that this matters is that processors’ merchants in aggregate need to be below a certain chargeback threshold in order for that processor to be sponsored with all the networks. In most cases, it's under a percent. If a processor has multiple merchants that are hitting those thresholds, a processor in aggregate could itself also start to get flagged from the networks and have restrictions, audit requirements, monitoring programs put on them.

The second reason is because liability rolls uphill. If those merchants on the processors don't have the capital to cover the chargebacks. Let's say they're on the verge of bankruptcy or they're a small business and they just made payroll last week and they don't think they're going to have the money to cover that chargeback or they're truly fraudulent and they close their account and walk away. That liability rolls over to the processor: it doesn't go to the network.

Risk sits with the merchant first. If the merchant is unable to cover that liability, then it goes to the processor. If the processor is unable to cover that liability, it falls to the bank. The networks don't take on any of that risk.

Can you give an example of this?

Furniture stores are historically very high risk verticals. The reason is because they sell large ticket items that customers pay for in advance and that, in some cases, may still be getting built to deliver later.

There have been a few cases in the last two years that I've heard of where a company has come to a processor and said, “We're a furniture company, we do custom furniture.” That company then charges people for a bunch of orders, closes their account, and walks away. In some cases, these are losses upwards of 50 million.

Processors have as much, if not more responsibility, to educate their merchants as merchants have to have a good relationship with their customers and clear communication on refunds. A merchant’s job might be to sell widgets online, for example, but our job as a processor is to help them.

Why are the rates of and dollar amounts of chargeback fraud rising?

Payments continues to digitalize. People who historically used cash are moving on to digital forms of payment. Even during a downturn, you don't see payments decrease – that digital conversion is still growing.

The second piece is that we just went through a huge period of inflation and are going through an adjustment in the market. The majority of the US probably can't afford half of the stuff that they need to buy. I'm seeing chargebacks in insurance through the roof, for example. People are charging back essentials and I think that's just because people can't afford the cost of living.

How have chargeback fraud and the ways of handling it changed over time?

It was harder to issue a chargeback before as a consumer. Banks have made that easily accessible and banks and issuers serve the cardholder: they don't serve the merchant. You can literally just log into your banking portal and decide “I'm gonna charge that back.” Historically, before you had to call or fax in evidence to the bank or issuer to even start that chargeback process. So I think actually, by streamlining that process within the issuer, we've made it more convenient. So in that way, I don't think progress has been positive.

If we're going to make chargebacks easily accessible for consumers, then we need to change the rules for merchants. But nothing's changed on the merchant side to make it easier to argue against a fraudulent case except for the fact that they can now upload documentation instead of fax it in. And now the funds are automatically drafted from merchants’ accounts and held in an FBO account.

There are processors offering merchants automated chargeback handling as a value proposition but … it’s essentially a bloated insurance policy.

Some payment processors create higher fees for merchants that have a lot of chargebacks or remove services to them altogether. But there are cons to refusing to many merchants for fear of chargebacks. What is the solution there?

You have to have a load balancing the risk. If you're going to take a customer that has higher chargebacks that you know is riskier, that's going to take some additional work, you have to have a lower risk business to offset it.

It is very specific processor by processor on how they're going to make those decisions. However, it’s more challenging for merchants and processors working in higher risk verticals. High risk typically means more chargebacks, more reputational risk, and thus difficulties for higher risk verticals to get support from processors. There’s going to have to be processors that eventually come out and figure out how to balance that risk or build tools to prevent those chargebacks.

Does it matter to processors if merchants are at a chargeback rate closer to .6% or .7% vs. .8%? Is there a marginal difference between these merchants versus merchants with slightly higher chargeback rates?

It depends on a processor’s contract with a network and it depends on the network. You might see a merchant has a .6% chargeback rate, but when you break it down in individual networks, their rates are much higher with Visa than they are with Amex, for example. Each individual network is going to take action on chargbeacks. So if a merchant can stay at that .6% across networks, then they're managing it. But if they're seeing 90% of that .6% hitting one network, they're probably going to get flagged by that network, which is going to cause them to get shut down. Even if it’s just one network that flags, and that merchant would get shut off for processing as a whole.

What is the role of data and providers like Enigma in helping processors manage chargeback risk?

One of the biggest issues we see is that during the onboarding process, there's really no way to validate prior processing history. So unless somebody has been shut down for chargebacks, and didn't have money in their account, they're not getting matched as a high risk for chargebacks. For example a prospect might sign up for an account at a payment processor and say they process 25 million but they really process 5 million right to get discounted pricing. A payment processor wouldn’t be able to build the right risk rules for that type of exposure.

Enigma is really valuable because you're going to build an educated risk algorithm because you have historical data from a third party.

How Enigma’s Data Can Help Solve Friendly Fraud

Enigma’s team wanted to find a way to manage this sort of growing merchant risk, working under our belief that one of the best ways to manage merchant risk is to determine its risk upfront.

We reviewed three merchants that were known to be highly risky for their processor or wholesale ISO, two insurance companies and one concert venue overselling tickets.

While Enigma data does not have chargebacks, we do have access to refunds. In partnership with our customers, we have empirical evidence that refunds are correlated with chargebacks.

When we look at how the refund to revenue ratios of the three merchants above rank among all merchants, we can see all of them rank in the worst 6% in terms of the proportion of refunds of total revenues.

Using this percentile rank on the merchants you’re considering onboarding can help provide a predictive signal about those that may be likely to over-index on friendly fraud – and that you may not want to work with. Moreover, you’d be able to compare reported data of prospects and customers to true data from Enigma, from processing volumes to transaction sizes.

Enigma can provide near-real-time data to help you make better risk and underwriting decisions.

Onboarding & KYB Product Updates: Q1 2024

Eric Land — Tue, 09 Apr 2024 00:00:00 GMT

Supporting the Jump in New Business Formation

Annual business formations jumped up to over 5 million after the pandemic and have continued to rise since: 2023 was another record year of business formations, up 56% from pre-pandemic baseline.

As we start the new year with these emerging businesses, financial service providers have a growing opportunity and a growing challenge. Onboarding more businesses, more quickly – amidst changing legislation – becomes a top priority.

In turn, Enigma is committed to improving our KYB and Onboarding product – launched late last year – to help FIs do just that. In Q1, we focused on making onboarding and integration of our KYB product easier via a Quickstart Guide on our Console as well as Tasks, an easily configured set of policy rules that meet your specific compliance requirements.

Q1 Onboarding & KYB Updates

Launch of “KYB Quickstart”

New customers can now log in to the Enigma Console and use the Quickstart Guide to familiarize themselves with our KYB product. This makes onboarding and integration a more seamless experience for customers.

Standardized Registration Status

We are now providing a standardized status on a business’s corporate registration across all jurisdictions. Now, customers can easily see if a business is active and in good standing in states where it does business, instead of dealing with hundreds of different statuses.

Launch of “Tasks”

We launched Tasks, a new feature that allows you to run standard KYB compliance checks to auto-approve more businesses with Enigma. You can now easily configure a set of policy rules that meet your specific compliance requirements using Tasks. This will also make it easier for developers to integrate with our platform.

Increasing Ease of Address Verification

We added a new task to make it easier for customers to perform address verification. Verifying the legal address and operating locations of a business can impact various decisions during the due diligence process. Compliance, underwriting, and other financial decisions often necessitate a comprehensive understanding of a business's physical presence.

Increasing Ease of Name Verification

We added a new task to make it easier for customers to perform business name verification. An essential aspect of the due diligence process involves identifying and verifying both the legal name and any DBAs associated with a business. This is crucial because a business might interact with you using its DBA, rather than its legal name. When seeking vital information about a business – such as its corporate registration – it is typically filed under the entity's legal name.

Latency Improvements

We invested in our data connections with Secretary of State departments to more reliably source corporate registration data. Improvements in our data pipeline ensure Enigma always has the freshest data possible to service customers’ KYB requests.

Legal Entity Type Surfacing

We are now surfacing the legal entity type of the business (corporation, LLC, etc,) according to the business’s corporate registration from the state in which they are headquartered. This enables customers to verify that the legal entity type on the corporate registration matches what the business filled in on their application for financial services.

New KYB Package, Without Watchlist Screening

We now have a new package, KYB_no_OFAC, that does everything in our pre-existing KYB package, minus the OFAC watchlist screening component. This new package allows us to serve our customers’ unique onboarding needs.

Matching Logic Updates

We have updated our matching logic to account for any differences in formatting, punctuation, or abbreviations between the customer input and the source data we are matching to. This normalization process enables us to provide a higher match rate for our customers.

Want to learn more?

Current customers can reach out to their CS representatives with questions and feedback. If you’re new to Enigma and interested in our Onboarding & KYB products, please get in touch.

Sales & Marketing Product Updates: Q1 2024

Eliza Cooke-Yarborough — Tue, 09 Apr 2024 00:00:00 GMT

For Enigma’s sales and marketing customers, we are always looking for ways we can use the huge range of data in our product to better tailor and segment businesses for targeting and outreach.

Enigma has >40 million websites in our data, and for a while we’ve been thinking about how to better leverage this data, use it to learn even more about the businesses in our product, and ultimately help our customers with better targeting for their sales and marketing campaigns.

So, I am excited to share more about our newly launched website scraping system. We are now able to scrape the content of all the business websites stored in our database. If you think about all the information we as humans can glean about a business by looking at its website, this opens up huge possibilities!

Two areas we decided to focus on initially - as important to our sales and marketing customers - were:

Using website scrapes as a feature in our industry prediction models (NAICS)
Using websites to understand whether a business has ecommerce capabilities.

Industry: NAICS predictions

Where we started: Enigma provides a NAICS code for >90% of marketable businesses in our product, with >90% precision. We provide a 6 digit NAICS code for ~60% of all marketable businesses. We wanted to use the website scrape data to help push our NAICS code coverage of marketable businesses even closer to 100%, and to ensure that even more of these were 6 digit NAICS, without sacrificing our 90%+ industry precision.

We already use machine learning models to predict NAICS codes, and while the business name and other attributes are important features in this model, we realized that the content of a business’ website is far more valuable (it is the first place I would go to try and decide which industry a business is operating in!).

As a human, if I look at a website like apple.com, it is clear to me this business is a technology company, engaged in manufacturing and selling phones, computers, watches and many other products and services (Apple TV, Apple Care etc.). An example NAICS code for this business is NAICS Code: 334210 - Telephone Apparatus Manufacturing, along with other NAICS codes related to the other product and business lines at the company.

If I try to build a simple machine learning model to predict the industry of a business named Apple, with website “apple.com”, it is not unreasonable for a model that has not been trained on this example, and with no context of what Apple does in the real world for it to guess that this business might be in NAICS Code: 111310 - Apple Orchards.

Today, if I ask ChatGPT which NAICS code a business called “Apple” with website “apple.com” is in, it immediately assigns NAICS Code: 334210 - Telephone Apparatus Manufacturing - with the caveat that the business also operates in other NAICS codes (for computer manufacturing, software publishing etc.).

However, I don’t necessarily need to go to ChatGPT to get this result. In fact, doing so would be unnecessarily expensive and slow on a per-record basis. Our data scientists found they were able to achieve results on par with LLMs (models like ChatGPT) by first using an LLM to predict the NAICS code, based on a small set of website scrapes. Then they were able to use this set as training data, feeding this into a pre-trained BERT model, and apply this fine-tuned BERT model to the entire population.

For our customers - this means an even higher proportion of our businesses have 6 digit NAICS codes (rather than say 2, or 3 digit NAICS), as our website scrapes allow for higher confidence predictions of granular codes. This also means that some businesses for which we see card revenue and have a website, but did not previously have an industry prediction, we now have an industry prediction - so customers have more leads with accurate revenues, in their target industries.

Ecommerce: Online Payment Capabilities

Another focus area for many of our sales and marketing customers is identifying businesses with ecommerce capabilities (i.e., those that accept online payments for goods or services).

Our existing ecommerce model identified around 700,000 ecommerce businesses in the US, but we know there are many more and wanted to expand this coverage to better serve our customers who want to target these businesses - whether they offer online payment processing, online buy-now-pay-later or point of sale financing, or offer auxiliary services to online retailers such as fulfillment/delivery of online orders.

Again, the first place I would look when trying to establish if a business has ecommerce capabilities is to go to their website and figure out if you can buy anything on there! Therefore, feeding our website scraped data to our ecommerce model would surely be a good way to identify additional ecommerce businesses, so that the model could access the content of the website when making its prediction, in the same way you or I can.

With the inclusion of our scraped data, we are now able to identify >1.5M US businesses that accept online payments. This is a huge win for customers who care about online payment processing, consumer financing of online purchases, fulfillment of online orders, as we can now provide even larger prospecting lists within their ICP.

Want to learn more?

Current customers can reach out to their CS representatives with questions and feedback. If you’re new to Enigma and interested in our KYB products, please get in touch.

Restaurants - Revenue Growth Since 2017

Enigma — Tue, 19 Mar 2024 00:00:00 GMT

Denver experienced a Covid boom, outperforming 2017 restaurant revenues even in the height of Covid-19. In 2023, revenues for the city were up, but nowhere near mid-2020 and late-2022. Denver was something of an outlier during this period - most other large or growing cities saw their restaurant revenues fall to ~40% of their January 2017 index during COVID, rebound over the next year to well above pre-COVID totals, flatten for 2 years, and decline slightly in 2023. Enigma's data can help you learn more about the near-real-time revenues of SMB accounts across the U.S. to help you better target, segment and engage prospects and customers.

The Enigma Data Catalog: Every Attribute, Explained

Enigma — Fri, 15 Mar 2024 00:00:00 GMT

Enigma provides a single reliable source of data about the business identity, firmographics, and financial health of small and medium businesses. This catalog covers every attribute available across card transaction data, business identity, compliance signals, and public records.

Use this as a reference when building integrations, evaluating data packages, or scoping a new use case.

Card Transaction Data

This group of attributes is derived from a panel of 700 million+ debit and credit cards, aggregated and matched to U.S. businesses using proprietary entity resolution.

Data	Description	Key Fields
Card Revenues	Monthly revenue a business receives from credit and debit card transactions, built from a panel of 700 million+ debit and credit cards.	Average monthly card revenue for the previous month, the previous three months, and the previous twelve months.
Card Revenue Growth	How card revenue is trending at a business over time.	Card revenue growth rate for the last twelve months compared to the previous twelve months; card revenue growth rate for the previous three months (seasonally adjusted and non-seasonally adjusted).
Card Transactions	Monthly number of credit and debit card transactions at a business.	Average monthly number of card transactions for the previous month, the previous three months, and the previous twelve months.
Card Transactions Stability	Distribution of card transactions at a business over time. Shows how many days, weeks, or months saw purchases at a business within a given time period.	Number of days transactions were present, number of weeks transactions were present, number of months transactions were present. Available over one month, three month, and twelve month time periods.
Customer Counts	Average number of daily customers a business has, based on credit and debit transactions.	Average daily count for the previous month, the previous three months, and the previous twelve months.
Card Refunds	Distribution of card refunds issued directly from the business.	Total refunds, average refund transaction size, refund to revenue ratio, refund to revenue growth rate. Available over one month, three month, and twelve month time periods.

Business Identity and Firmographics

These attributes cover the core identity of a business — who they are, where they operate, and how they're structured.

Data	Description	Key Fields
Industry	Industry classification equivalent to a 2–6 digit NAICS code, signifying the primary purpose of a business.	Industry classification code (integer), industry classification description, industry classification type (string).
Enigma ID	Enigma's unique identifier for the business.	Enigma ID.
Aliases	The various names for the business found in Enigma's data sources. Names are ordered by prevalence.	Alias.
Addresses	List of addresses for the business.	Street, city, state, and postal code.
EIN Number	The EIN numbers associated with the business — the nine-digit employer identification number assigned by the IRS.	EIN.
Phone Numbers	The phone numbers associated with the business.	Phone numbers.
Year Incorporated	Year of the earliest incorporation date found in Enigma's data sources.	Year incorporated.
Websites	List of website URLs associated with the business.	Websites.
Associated People	People associated with the business and their titles.	Name, Titles.
Registered Agents	Registered agents associated with the business. A registered agent is a business or individual designated to receive service of process when a business is involved in a legal matter.	—
Corporate Structure	The legal structure under which the business is incorporated.	Corporate structure.

Business Verification and Compliance

These attributes support KYB workflows, compliance checks, and onboarding decisioning. See Enigma KYB for more on how these signals power verification.

Data	Description	Key Fields
Business Verification	Enigma's confidence that a business exists and is in good standing. Computed using several verification components: data freshness, footprint, and activity.	Verification score, data freshness, source quality, data footprint, business activity.
Match Confidence	Indicates the extent to which a query matched a business record in Enigma's SMB data asset, where 1 represents an exact match.	Match confidence (number from 0 to 1).
Matched Fields	Shows the name, address, and person in Enigma's data asset that were found.	Name, Person, Address. Address includes street address 1, street address 2, city, state, and postal code.
Match	Whether the match confidence for a returned business exceeds the match threshold.	Is matched ("True" or "false").

Corporate Registrations

Data	Description	Key Fields
Corporate Registrations	List of a company's corporate registrations, with full details. Can be used to verify a business.	Business name, domestic or foreign, mailing address, registered agent address, registered business address, site address.
Corporate Registrations (basic)	A list of states and corresponding registration dates for state-level corporate registrations found in Enigma's data sources. Does not provide details such as domestic or foreign status or mailing address.	State, Issue date, File number.

Public Records and Risk Signals

Data	Description	Key Fields
Construction Licenses	Data about businesses that currently hold (or previously held) a construction license.	License history flag, license start and end dates, license number, licenses classification, state-specific details.
SBA Loans	Data about loans backed by the U.S. Small Business Administration (SBA).	Total number of loans in each category, total value of loans in each category, loan status, loan approval date, number of months in loan term, approval amount, loan type.
WARN Notifications	Data about layoffs or closings provided in compliance with the Worker Adjustment and Retraining Notification (WARN) Act.	Layoff or closing, number of employees impacted, total headcount before layoff, percent of employees laid off, reason, notice received date, layoff or closing date.
Business Bankruptcies	Publicly filed bankruptcy information for a given company.	Bankruptcy flag, chapter type, filing date, case number.

How to Access This Data

Enigma's data is available via API and through the Enigma Console. Attributes are available individually or bundled into packages depending on your use case.

For go-to-market teams using card transaction data to identify fast-growing prospects, see the GTM guide to prospecting with card transaction data.

For risk and underwriting teams, see the guide to mitigating risk with card transaction data.

Ready to explore what's available for your use case? Talk to the Enigma team.

Mitigating Risk with Card Transaction Data 101

Enigma — Fri, 15 Mar 2024 00:00:00 GMT

Eight in 10 Americans report having at least one credit card, and there were more than 511 million active consumer credit cards in the United States in Q1 2020. The pandemic accelerated movement away from cash: according to McKinsey, by the end of 2020, U.S. consumers used cash for just 28% of transactions, compared to 51% a decade prior.

For risk leaders serving small businesses, this increasing adoption of cards — and the data and intelligence they generate — unlocks new opportunities for mitigating risk and monitoring growth across your small business portfolio.

What Is Card Transaction Data?

"Card transaction data" typically refers to data generated when a credit or debit card is used to purchase goods or services from a business. To protect privacy, individual cardholders are anonymized and transactions are aggregated.

Card transaction data can include more than just consumer credit cards. The data can be derived from all kinds of cards, including debit cards, small business cards, corporate cards, government benefit cards, and charge cards. It can also include digital transactions — also known as "card not present" transactions.

Where Does Card Transaction Data Come From?

Transaction data provided by data companies can come from a variety of sources. Data may come from a bank integration, or be aggregated by a card issuer, a credit card network, or a payment processor.

When working with transaction data, it's crucial to understand what kind of source it comes from. Many sources skew toward certain groups of consumers, geographic areas, or types of transactions. Knowing the size of the sample and any biases in the data source enables you to better understand how to derive trustworthy insights.

Raw transaction data is notoriously difficult to analyze. In its raw form, the data is messy, inconsistent, and sometimes duplicative, requiring organization and cleanup at scale before it's ready to use.

The Entity Resolution Problem

A real-world example: different payment processors refer to the same business — a coffee shop called Bodhi Leaf in Orange, California — as "Bodhi L," "Bodhi Leaf Coffee," "Bodhi Leaf Coffee Traders," "Bodhi Leaf Trading Company," and "Bodhi Leaf Tradi."

Uniting this data into a holistic view of transactions at a business level requires sophisticated algorithms and entity resolution techniques to clean and match the data. Enigma's dataset aggregates and matches raw transactions to more than 10 million U.S. businesses.

How Risk Teams Use Card Transaction Data

Historically, card transaction data was used as a bellwether for consumer trends. It's been increasingly recognized that this same data provides valuable insights about the health of a business. Trends in card revenues, transaction volumes, and customer concentrations reveal whether a business is growing or declining. When aggregated by business, this data is often referred to as "merchant transaction data."

Card revenue does not reflect all of a business's revenue, but the accelerating shift away from cash makes it an increasingly reliable signal. Merchant transaction data is especially helpful in industries where a high proportion of transactions are made by card — retail shops, restaurants, and service providers in particular.

Three Ways Risk Teams Are Using This Data

Accelerate the underwriting process. Timely data on business revenue removes friction from underwriting — you can ask for less paperwork and have more confidence in signals of business health built from actual card transactions. This type of data has allowed organizations to increase underwriting approvals without increasing risk.

Set, monitor, and adjust credit limits. When you can monitor portfolios and proactively identify customer accounts eligible for more credit, you can increase credit lines to drive higher spending per account. Being able to identify higher-risk accounts earlier means you can pinpoint when to decrease credit lines and dial back risk.

Streamline the pre-approval process. Get a look at monthly revenues and transactions without asking — and waiting — for bank statements or application forms.

What to Consider When Selecting a Transaction Data Source

When evaluating a card transaction dataset, the right questions help you compare options and understand which dataset best suits your needs.

Latency. How fresh is the data? How frequently is it updated?

Coverage. How many cards are included in the panel? Is it just credit cards or debit cards as well? How many businesses are covered?

Panel bias. What is the scope of the panel? Is it just Visa or just Mastercard? Is it skewed toward certain geographies or income classes?

Permissioning. Some data providers require you to get permission from a business before accessing its transaction trends. Others — like Enigma — have already integrated privacy protection into their system so that you can immediately access data about any business.

The Data: What Enigma's Merchant Transaction Signals Include

Enigma's Merchant Transaction Signals are derived from a panel of more than 750 million anonymized credit and debit cards — across types like general purpose credit cards, consumer and small business debit, small business credit, health savings and flexible spending accounts, gift cards, and more. The data is matched to more than 10 million U.S. businesses and refreshed monthly.

Data	Description
Card Revenues	Monthly revenue a business receives from credit and debit card transactions
Card Revenue Growth	How card revenue is trending over time, with seasonally adjusted and non-seasonally adjusted views
Card Transactions	Monthly number of credit and debit card transactions
Card Transactions Stability	Distribution of transactions over time — how many days, weeks, or months saw purchases
Customer Counts	Average number of daily customers based on card transactions
Card Refunds	Refunds issued to credit or debit cards, including ratio of refunds to total revenue

For a full breakdown of every available field, see the Enigma Data Catalog.

Putting It to Work

Whether the market is challenging or conditions are favorable, monitoring your portfolio to reduce costs and find growth opportunities is a constant imperative. Risk teams that ground decisions in timely, accurate small business intelligence will be best positioned to protect and expand their small business portfolios.

GTM teams looking for a related perspective on using this data for prospecting can read the go-to-market guide to better prospecting with card transaction data.

Ready to see how Enigma's card transaction data can improve your risk decisions? Get in touch.

The Go-to-Market Guide to Better Prospecting with Card Transaction Data

Enigma — Fri, 15 Mar 2024 00:00:00 GMT

For go-to-market leaders engaging small businesses, this increasing adoption of cards — and the data and intelligence they generate — unlocks new opportunities for prospecting and prioritizing your ideal customers.

What Is Card Transaction Data?

Where Does Card Transaction Data Come From?

The Entity Resolution Problem

Uniting this data into a holistic view of transactions at a business level requires sophisticated algorithms and entity resolution techniques to clean and match the data. This is a core part of what Enigma does: our dataset aggregates and matches raw transactions to more than 10 million U.S. businesses.

How GTM Teams Use Card Transaction Data

Historically, card transaction data was used as a bellwether for consumer trends. Recently, it's been recognized that this data can also provide valuable insights about the health of a business. Looking at trends in card revenues, transaction volumes, and customer concentrations can reveal whether a business is growing or declining. When aggregated by business, this data is often referred to as "merchant transaction data."

Card revenue does not reflect all of a business's revenue, but the accelerating shift away from cash makes it an increasingly reliable signal. Merchant transaction data is especially helpful for businesses operating in industries where a high proportion of transactions are made by card — retail shops, restaurants, and service providers in particular.

Four Ways GTM Teams Are Using This Data

Build custom ideal customer profile lists. Instead of manual research, you can use revenue trends to identify fast-growing businesses and prioritize prospecting targets with criteria relevant for your business — industry, revenue, or growth metrics.

Qualify leads. Visibility into your leads' revenue trends helps you improve lead segmentation and scoring for better ROI on campaigns. Signals in this data — say, no transactions present in the past 12 months — can also help you remove closed businesses from your database, so you're not wasting money sending direct mail to an empty storefront.

Improve segmentation. When transaction data is packaged with identity data, monthly intelligence helps you fill in and clean up your database. Better segmentation means higher response rates and less wasted spend on unqualified businesses.

Grow revenue from existing customers. Card transaction data is a valuable tool for identifying cross-sell and upsell opportunities with existing customers. Deep insights about a business's revenue enable you to market the right products at the right time.

What to Consider When Selecting a Transaction Data Source

When evaluating a card transaction dataset, the right questions help you compare options and understand which dataset best suits your needs.

Latency. How fresh is the data? How frequently is it updated?

Coverage. How many cards are included in the panel? Is it just credit cards or debit cards as well? How many businesses are covered?

Panel bias. What is the scope of the panel? Is it just Visa or just Mastercard? Is it skewed toward certain geographies or income classes?

The Data: What Enigma's Merchant Transaction Signals Include

Enigma's card transaction data is derived from a panel of more than 750 million anonymized credit and debit cards — across types like general purpose credit cards, consumer and small business debit, small business credit, health savings and flexible spending accounts, gift cards, and more. It's matched to more than 10 million U.S. businesses and refreshed monthly.

Data	Description
Card Revenues	Monthly revenue a business receives from credit and debit card transactions
Card Revenue Growth	How card revenue is trending over time, with seasonally adjusted and non-seasonally adjusted views
Card Transactions	Monthly number of credit and debit card transactions
Card Transactions Stability	Distribution of transactions over time — how many days, weeks, or months saw purchases
Customer Counts	Average number of daily customers based on card transactions
Card Refunds	Refunds issued to credit or debit cards, including ratio of refunds to total revenue

For a full breakdown of every available field, see the Enigma Data Catalog.

Putting It to Work

Whether you're headed into a challenging market or it's clear skies ahead, reducing costs and fueling growth are business imperatives. GTM teams that rely on timely, accurate small business intelligence for marketing and sales decisions will be best positioned to make smart investments — in their budgets and in their time.

Risk teams looking for a related perspective can read the guide to mitigating risk with card transaction data.

Ready to see what Enigma's card transaction data can do for your pipeline? Get in touch.

Restaurants - Businesses % vs. GPV %

Enigma — Tue, 12 Mar 2024 00:00:00 GMT

Less than half of restaurants have multiple locations, however, these chains take up an outsized share of Gross Processing Volume (GPV), or the total value of transactions that pass through a payments system. Businesses with 10+ locations account for 52% of total restaurant GPV in the U.S. while single location restaurants win only 29% of total restaurant GPV.

Enigma's data can help you learn more about the revenue of single-location and chain SMB accounts to help you better target, segment and engage prospects and customers.

Restaurants - Monthly YoY Revenue Growth

Enigma — Tue, 12 Mar 2024 00:00:00 GMT

Enigma's data on the largest transaction panel in the U.S found that after a year of negative national revenue growth, the restaurant industry rebounded in growth 2021, peaking in April. However, growth once again slowed in 2022 and by February 2023 began to decline. Enigma's data can help you learn more about the near-real-time revenues of SMB accounts across industries to help you better target, segment and engage prospects and customers.

Restaurants - Revenue by Order Value Bucket

Enigma — Tue, 12 Mar 2024 00:00:00 GMT

Fine dining is finding a foothold despite economic pressures in 2023. While low-cost and mid-cost restaurants are struggling to grow beyond pre-pandemic highs, restaurants with an average ticket of $75-$100 and over $100, doubled their share of total restaurant revenue from 2017 to 2023.

Enigma's data can help you learn more about the near-real-time revenues of small businesses across different pricepoints to help you better target, segment and engage prospects and customers.

Restaurants - The Infatuation Effect

Enigma — Tue, 12 Mar 2024 00:00:00 GMT

Call it The Infatuation Effect: While restaurants in cities across the US experience middling or negative growth, restaurants featured in The Infatuation’s Top 25 Restaurants lists outpaced them. This also holds true across pricepoint: only high-end restaurants in San Francisco and Los Angeles beat out The Infatuation's Top 25.

Enigma's data can help you learn more about the near-real-time revenues of businesses across different price-points to help you better target, segment and engage prospects and customers.

Restaurants - What's Cool to Drink

Enigma — Tue, 12 Mar 2024 00:00:00 GMT

Bubble tea leads the drinks pack. Average annual revenue growth of bubble tea shops outpaces wine bars and juice shops. Leading peers in revenue growth are i-Tea, Boba Love, Tea Top, Da Boba, and Tea Time.

Enigma's data can help you learn more about the near-real-time revenues of small businesses across industries and subindustries to help you better target, segment and engage prospects and customers.

Introducing Enigma Customer and Transaction Screening

Enigma — Tue, 05 Mar 2024 00:00:00 GMT

27 million people are currently victims of human trafficking worldwide. They are being forced into labor, sexual exploitation, and other forms of modern slavery. Human traffickers make an estimated $150 billion every year.

Fentanyl overdose is the leading cause of death for Americans 18-45. The drug is 50 times stronger than heroin and 100 times stronger than morphine. The illicit drug trade is a $650 billion dollar business annually, greater than the entire GDP of Sweden.

Where does all that money go? Money laundering is the world’s 3rd largest business. An estimated 4 trillion dollars are illicitly laundered every year. The bulk of that comes from nefarious activities and organized crime.

To combat this, every financial institution is constantly combing through a vast matrix of data for markers of financial crime. Financial institutions filed over 251 million suspicious activity reports to U.S. regulators last year alone.

Sanctions are a critical measure towards combating financial crimes and serving our national security interests. Sanctions compliance is a greater challenge than ever before thanks to the rapidly evolving geopolitical stage. Enigma set out to make sanctions compliance a sure bet for highly regulated institutions in the most cost-effective, reliable and transparent manner possible.

Why We Built Enigma Customer and Transaction Screening

We have spent years closely collaborating with our customers to learn about the challenges they face in screening massive amounts of customers and transactions. Major problems included high operational costs associated with many false positive matches, throughput challenges due to the vast scale of customers and transactions, a lack of control over system configuration and performance, and shortcomings in the auditability and explainability of results to regulators.

Five years ago, we were approached by one of our leading clients, a top-ten US bank, and were asked to help them overhaul their sanctions compliance program. Leveraging our entity resolution expertise, we developed a customer and transaction screening product that replaced the bank’s legacy screening solutions, both well-known vendor solutions in the space, vastly improving alert quality while assuring greater throughput, scalability and reliability. Upon go-live with our screening API, the bank realized a >75% reduction in alert volumes and millions of dollars per year in savings in operational overhead alone.

Our mission is to bring our state-of-the-art screening API to the broader ecosystem of clients and partners who are facing ever greater regulatory scrutiny in the fight against financial crimes and our nation’s and allies’ adversaries.

How Enigma Customer and Transaction Screening Can Help You

Enigma's screening service is built for regulated institutions that have a massive quantity of customers and transactions that need to be screened. Cloud-based and API-enabled, our service seamlessly integrates with the watchlists and case manager of your choice. Enigma Customer and Transaction Screening currently screens more than 1 billion real-time requests per month (customers, wires, ACH, P2P), effortlessly auto-scaling as processing demands fluctuate.

We provide customer screening both at the time of onboarding and on an ongoing basis for both persons and business entities, with options to fine tune screening of each according to your organization’s risk-appetite. We also batch monitor millions of customer records each night for any changes in sanctions status.

With fewer false positives and higher throughput, our transaction screening helps you put fewer transactions on hold due to compliance concerns. Enigma integrates with any type of transaction - wires, ACH/TCH, Zelle, etc. We have developed specialized free text screening to handle message body and transaction descriptions, while leveraging our entity screening service to assess transaction counterparties. The combination of the two assure the highest recall and precision possible in identifying suspicious transactions.

How do we differ from other solutions out there? Enigma Customer and Transaction Screening:

Prioritizes Precision: Enigma’s tried and true matching methods reduce traditional sanctions alert volumes (where >99% are false positives) by at least 80% to help you cut through the noise and prioritize real alerts impacting your business.
Covers Multiple Watchlists: You choose which list sources you want to screen. Enigma can provide curated lists relevant to your business requirements or can integrate with any other list providers. You can also add or remove specific customs list and keywords anytime.
Scales with High Throughput: Our cloud-based infrastructure autoscales, ensuring the ability to process large volumes of transactions and customers with low latency. We consistently surpass 1,500 requests per second throughput, and can go even faster.
Empowers Configurability & Control: Directly configure thresholds, adjust scoring weights, and create suppression and escalation rules anytime, with robust access controls to prevent undesired changes. No need to work with an account manager to make changes to your scoring model.
Allows for Auditability: Hit-level explanations are clear and intuitive. Enigma employs well-documented matching and scoring algorithms that are tunable and transparent to the end user. There is no black box: you can look up historical screening decisions and watchlist entity entity details anytime and understand the results.
Provides a Sandbox Environment: Sanctions and model risk teams can study the effects of adjusting system parameters or introducing new rules. Simply make the adjustments and run a batch test to see the impact. Privileged users can evaluate specific screening requests and decisions on demand, anytime they want

How to Learn More

We offer registered Enigma users access to an evaluation version of our screening console, wherein you can submit customer screening requests against OFAC lists. You will see first hand the flexibility with which you can include varying degrees of identifiable information to increase the accuracy of the hits, as well as how you can configure relevance weights for each attribute screened, among other system parameters that affect scoring and alerting. The evaluation console is currently limited to customer screening only -- transaction screening is not yet open for public evaluation but soon will be.

We also invite you to learn more about our sanctions API for more programmatic interaction with our screening endpoint. From there, we can guide you through the steps necessary to integrate with any watchlists of your choosing, including custom lists, as well as any case manager that suits your preference. Because our API is list and case manager agnostic, it allows our clients to tailor the screening endpoint to satisfy their needs, budget and volume demands.

To get complimentary access to our evaluation console and API, please register for free.

Onboarding & KYB Product Updates: Q4 2023

Eric Land — Fri, 26 Jan 2024 00:00:00 GMT

To 2023, and Beyond

2023 was a year marked by both rising payment fraud - more than double that of 2021 - and increased regulatory scrutiny for banks and other FIs. As compliance rules changed and regulatory fines rose in Q4 2023, we released a Know Your Business (KYB) solution that solved the challenges like these that our customers were facing. We launched two solutions – Enigma Identity and Enigma KYB – to help our customers instantly verify more businesses and cut their KYB costs.

And stay tuned as we continue to further FIs’ ability to onboard customers and combat fraud in 2024.

Q4 2023 Updates

Enigma Launches Identity and KYB products

Enigma launched new Identity and KYB products in October 2023. With our KYB & Onboarding products, customers saw 1.5X higher instant verification rates compared to other providers and reached up to 80% saving in onboarding costs. We have over 100M business entities in our database and our match rates in Q4 2023 reached up to 90%.

We have already onboarded several new customers and workflow partners to this new KYB product offering. On top of supporting end-users who use our products to directly onboard their customers, we also inked partnership deals with orchestration platforms like Alloy, Oscilar, Taktile, and Trulioo to make integrating our data into your KYB process easier.

As part of the launch, we created a KYB endpoint built on top of a data asset featuring Secretary of State registrations as its core building block. This endpoint can also be used to call other data attributes for broader identity purposes, but we built the endpoint to maximize registration coverage and accuracy.

Enigma Identity vs. Enigma KYB Explained

In Q4, we launched both Enigma Identity and Enigma KYB. To help you figure out if either is a fit for your needs, we wanted to give you a quick primer on the varying use cases and features of the two products.

What is Enigma Identity?

You’d use Enigma Identity if your end goal is to simply verify the identity of a business before you engage in a transaction with them. Enigma Identity helps you answer if, a. “Is this business real and legitimate?” and b. “Can I be confident in doing business with this business?”

You can use Enigma Identity to:

Make a decision on whether to engage in a business transaction
Make a decision on whether to offer a business trade credit
Make a decision on whether to use a supplier

The base data that you’d need to verify to quickly make decisions – and the data attributes we provide you with – include a business’ name, address, website, and whether or not they engage in a risky industry or activity.

What is Enigma KYB?

You’d use Enigma KYB, meanwhile, if your needs for verifying a business fall under KYB or AML compliance – e.g. if you’re a bank or other financial institution.

You can use Enigma KYB to:

Gain confidence in the businesses you’re onboarding
Meet your KYB and AML compliance goals
Protect against reputational risk

Scam alert? Targt.net and Cserv5.com are websites that exist, but if you run them through Enigma KYB you will find this business has virtually no other identifying information such as a verified name/address or corporate registrations.

Behind the Enigma Identity and KYB Curtain

To power Enigma Identity and Enigma KYB, we also developed new data attributes. In Q4 alone, we launched:

Name Verification of legal name as well as “doing business as” (DBA): An essential aspect of the due diligence process involves identifying and verifying both the legal name and any DBAs associated with a business. This is crucial because a business might interact with you using its DBA rather than its legal name. When seeking vital information about a business, such as its corporate registration, it is typically filed under the entity's legal name.
Address Verification of legal address as well as operating address: Verifying the legal address and operating locations of a business can impact various decisions during the due diligence process. Compliance, underwriting, and other financial decisions often necessitate a comprehensive understanding of a business's physical presence.
Website identification: Websites play a crucial role not only in confirming the legitimacy of a business but also as valuable sources of information regarding the business and its industry.
Identify industry and risky activities: Know if the businesses you’re onboarding operate or sell in risky industries such as cannabis, multi-level marketing, or firearms.
Corporate registration filings with the Secretary of State (only in Enigma KYB): Know if the businesses you’re onboarding have a registration filing with the Secretary of State that is active and in good standing. Registration information can help determine businesses are legitimate, while complying with KYB regulations.
OFAC watchlist screening for people and businesses (only in Enigma KYB): Mitigate against the financial and legal risks of associating with sanctioned entities and ensure compliance.

And that’s only the start. With high coverage of basic firmographic and registration data, any customer can now get to “knowing” the businesses they work with. Over the coming months, we’ll be reaching for the “beyond”, by which we mean holistic merchant risk, fraud, and more, by exploring how our proprietary transactions data can provide unique insight into chargebacks, corporate hierarchies, operating status, and more.

Want to learn more?

Current customers can reach out to their CS representatives with questions and feedback. If you’re new to Enigma and interested in our KYB products, please get in touch.

How FinCEN’s New BOI Rule Might Affect FIs

Enigma — Tue, 02 Jan 2024 00:00:00 GMT

As of January 1, 2024, businesses must report all beneficial ownership information (BOI) to the U.S. government, according to the latest FinCEN mandate. This new regulation falls under the 2021 bipartisan Corporate Transparency Act (CTA), which aims to make it harder for bad actors to hide gains through hidden ownership structures and shell companies. But there is still some uncertainty of what new BOI reporting standards – and a new BOI database – will mean for the financial institutions (FIs) that serve reporting businesses.

Currently, FIs rely on customer-reported BOI under the Customer Due Diligence (CDD) Final Rule — but the CTA database potentially offers a new avenue for collecting information. How FIs access and incorporate this data into their pre-established Bank Secrecy Act / Anti-Money Laundering (BSA/AML) compliance processes, though, still hasn’t been fully answered by FinCEN.

Enigma chatted with Ballard Spahr’s Peter Hardy, a former federal prosecutor, and current national thought leader on the subject of money laundering, anti-money laundering, and criminal tax law. Hardy discussed how FIs can use BOI obtained under the CTA to conduct enhanced due diligence, what accessing the new BOI database might look like for FIs, and the remaining questions facing FIs — such as how to align their duties under the CTA and CDD rules and how to adjust their compliance process to new BOI reporting standards.

This interview has been edited for length and clarity.

How will FinCEN’s new Beneficial Ownership Information regulations – outlined in the Corporate Transparency Act – affect FIs? How will these new regulations align or clash with standards outlined in the Customer Due Diligence Final Access Rule?

Although FinCEN still needs to issue proposed regulations aligning the CTA with the CDD Rule, the finalized BOI access regulations certainly shed some light on how FIs will be able to access and use BOI obtained under the CTA. First, and unlike the previously proposed BOI access rule, FIs will not be confined under the final access rule to requesting and using CTA BOI only for “pure” CDD Rule compliance. Instead, FIs will be able to access CTA BOI more broadly, such as for the purposes of maintaining their BSA/AML compliance program; compliance with sanctions screening; potential filing of Suspicious Activity Reports (SARs); and conducting enhanced due diligence. This is an important revision, which attempts to address prior criticisms from FIs and other stakeholders that broader access to BOI is necessary to both effectuate the goals of the CTA and for FIs to comply more effectively with the BSA in general.

What might access to a BOI database mean for FIs? Can FIs access this information?

The final access rule has streamlined access to the BOI database. This is certainly true for federal law enforcement, which may access and query the BOI database directly, but also for FIs, which likewise will have direct access to BOI, but in a more limited fashion than the government. FinCEN still needs to publish the proposed forms for requesting access to the BOI database, so the details of how exactly this will work remain unclear.

What are some of the remaining questions and challenges affecting FIs when it comes to collecting BOI?

The forthcoming proposed CDD alignment regulations will need to address several important remaining questions and potential challenges facing FIs.

First, they should state explicitly that FIs are not required to access the BOI database – particularly because FinCEN’s forthcoming BOI reporting form presumably will allow reporting companies to not provide key information.

Second, they should provide a clear and practical mechanism for FIs to address situations in which BOI collected under the CDD Rule does not match BOI obtained through the CTA – particularly because FinCEN has indicated that it will not verify the accuracy of BOI collected under the CTA.

Third, and assuming that the proposed regulations change the current exceptions to CDD Rule reporting (because exceptions to reporting under the CTA and the CDD Rule are currently different), they should explain clearly how FIs can adjust effectively their current CDD Rule reporting systems, which have been in place for years, and provide sufficient time to do so.

Fourth, they should include a safe harbor from liability for FIs that use BOI obtained from the CTA database.

Fifth, they should explain clearly how FIs can obtain customer consent to access CTA BOI.

Finally, they should state explicitly that FIs may rely on BOI obtained from the CTA database, just like FIs may rely upon BOI obtained from customers under the current CDD Rule.

Enigma aims to help FIs in these ongoing KYB and onboarding challenges through Enigma KYB. We enrich traditional KYB data sources like Secretary of State filings with foreign filings, operating names, operating addresses, and websites, to provide a complete view of a business. This means we can better match businesses off of DBA names and other common application inputs. Because of this, in tests we found Enigma was able to instantly verify 70% of business applicants, compared to 40.8% by the leading competitor. Contact us to learn more.

A/B Test Results Show That Enigma Data Is Helping Banks Grow 1.4x More Efficiently

Enigma — Fri, 15 Dec 2023 00:00:00 GMT

After the collapse of First Republic and Silicon Valley Bank earlier this year, banks have had an increased focus on growing deposit relationships and volumes. Enigma is helping banks on this front, providing data to improve their prospecting efforts and win new customers.

Specifically, Enigma helps banks target the businesses that are most likely to convert at that given point in time. Enigma does this by providing prospects that fit a bank’s ideal customer profile (ICP) and visibility into which businesses may be experiencing life cycle events that makes them more likely to want to change banking relationships (e.g. a business that just opened a new store).

For example, one bank wanted to reach businesses that met the following criteria:

Retail-focused with ecommerce capabilities
Opened a new store in the last few months or had recently crossed $1M in revenue

While the specifics of the targeting criteria changes from bank to bank, Enigma data is helping banks more efficiently reach their desired targets.

Below, we answer the question of just how much our data is working its magic by diving into real results.

Methodology: With the help of marketing agencies, we asked bank and payments provider partners to A/B test the impact of using Enigma data in their sales and marketing campaigns. The test group was the set of prospects they generated using Enigma data. The control group was a different sample of prospects without the benefit of Enigma data, generated from the more traditional business data sources. They then measured the number of successful sign-ups (conversions) per dollar spent for both samples.

Results

The marketing campaigns were all omnichannel, covering both direct mail and online advertising. Most of the campaigns also featured some direct sales outreach to top prospects.

This A/B testing was conducted over the past 18 months, spanning several different quarterly growth campaigns.

On average, banks leveraging Enigma data saw a nearly 1.4x increase in conversion (new accounts created) compared to a control group that used traditional data sources.

Gross lift measures the uptick in total accounts created using Enigma leads versus leads from other sources.

For example, if we have the following conversion rates (% of accounts created out of total leads that were sent mail):

0.30% conversion rate when using Enigma
0.20% conversion rate when using other sources

This would imply:

0.30%/0.20% = 1.50 gross lift

If you are a bank focused on efficient growth, these results indicate that you could grow your customer base by spending 1-2x less using the same sales and marketing spend.

As we continue to receive A/B test results, we will update this blog series.

Read about similar outcomes in a recent case study, or reach out to our team to hear firsthand how our data is helping payments companies grow merchant accounts.

A/B Test Results Show That Enigma Is Helping Payments Providers Grow 3x More Efficiently

Enigma — Tue, 12 Dec 2023 00:00:00 GMT

Over the last 3 years, many payments companies and merchant services divisions of banks have turned to Enigma to help improve customer prospecting efforts.

It’s an obvious fit. Payments providers have a specific ideal customer profile (ICP) where they are positioned to win, and Enigma can uniquely identify prospects that fit that definition. A few examples of payments providers' ICP requests that Enigma can help segment by include:

Is a sit down restaurant but not a cafe or bakery
Has a Gross Processing Volume between $500k and $2M
Is processing with Square or Toast

Sales and marketing campaigns drive higher ROI when a target prospect list more precisely matches your ICP. If there is a looser fit, you and your customers waste money targeting prospects that aren't a good fit.

We often get asked by potential partners: I get the value of your product in concept, but how much does your data actually boost results?

So, we set out to answer this question quantitatively. And we're quite impressed with the results.

Methodology

With the help of marketing agencies, we asked four bank and payments provider partners to A/B test the impact of using Enigma data in sales and marketing campaigns. The test group was the set of prospects they generated using Enigma data. The control group was a different sample of prospects without the benefit of Enigma data, generated from the more traditional business data sources. After defining test and control groups, these providers then measured the number of successful sign-ups (conversions) per dollar spent for both groups.

Results

The marketing campaigns were all omnichannel, covering both direct mail and online advertising. Most of the campaigns also featured some direct sales outreach to top prospects.

This A/B testing was conducted over the past 18 months, spanning several different quarterly growth campaigns.

On average, payments providers saw a 3x increase in conversion using Enigma data compared to the control of using traditional sources.

Gross lift measures the uptick in merchant accounts created using Enigma leads versus leads from other sources.

For example, if we have the following conversion rates (% of accounts created out of total leads that were sent mail):

0.35% conversion rate when using Enigma
0.14% conversion rate when using other sources

This would imply:

0.35%/0.14% = 2.50 gross lift

If you are a payment processor focused on efficient growth, these results indicate that you could achieve the same results by spending 2-3x less. If you are focused on growing as fast as possible, the results imply that you could grow 2-3x faster using the same sales and marketing spend.

As we continue to receive A/B test results, we will update this blog series.

Read about similar outcomes in a recent case study, or reach out to our team to hear firsthand how our data is helping payments companies grow merchant accounts.

Controlling for Bias: How We're Strengthening Our Card Revenue Estimates and Making Them Even More Accurate

Enigma — Fri, 01 Dec 2023 00:00:00 GMT

When it comes to targeting business customers in sales and marketing campaigns, revenue is one of the most important segmenting dimensions. Knowing a company’s revenue, and how it’s trending over time, is a powerful indicator of whether the business will be a fit for your product/service — and is an important input in understanding a customer’s lifetime value.

But accurate revenue data is incredibly hard to procure, especially at the time of prospecting. In many situations, the industry defaults to the use of modeled revenue, which is largely inaccurate.

At Enigma, we take a different approach by providing our customers with the most accurate revenue data out there. Our estimates of companies’ card revenues are derived from actual transaction data from a panel of over 750 million active credit and debit cards, covering ~40% of credit card transactions in the United States.

Because our panel doesn’t cover 100% of transactions, we must apply a projection factor to estimate the total revenue, transaction counts, and customer counts for a given business.

We initially launched our product using a standard projection factor (a multiplier of 2.86*) in order to provide our customers with revenue data as quickly as possible:__

Card revenue Enigma sees x 2.86 = Total card revenue for a given business location

It’s always been our goal to create more sophisticated projection factors that account for biases in our panel. Now, we’re excited to announce our ability to adjust these projection factors to account for over- and underrepresented populations within the panel — and to calculate total card revenue more confidently than ever before.

Read on for the details of what’s changing, as well as what this means for Enigma customers.****

What is panel bias?

Every panel has bias, and it is nearly impossible to create a selection that is perfectly representative of the total population. What’s important when building a data product is to recognize that bias exists, and control for it.

For example, imagine if Enigma’s card panel had 50% of people in Town A, but only 33% of people in Town B. If Enigma applied a standard projection factor to transactions seen in the two towns, then the revenues of business in Town A may be systematically overestimated, and the revenues of businesses in Town B may be systematically underestimated.

The ideal outcome in this simple, hypothetical scenario would be to use a factor of 2 for Town A and a factor of 3 for Town B. ****

Strengthening our revenue calculation model and controlling bias

We no longer apply a single projection factor to all businesses, and now account for the following biases when determining the best projection factor to apply to transactions we see in our panel:

Geographic bias
Card type bias
Income bias
Size of purchases bias

We partnered with one of the largest US payment processors to evaluate our card revenue estimates against theirs. We found that our error rate decreased to 33%, an 8% improvement. We also found that our estimates lined up nearly 75% of the time when comparing Business A versus Business B in terms of card process volumes.

Below is a description of each type of bias that we are now adjusting for.**

Geographic Bias

We now control for where we have more cards in our panel in certain geographies over others. Examples include:

LA is under-represented in our panel, so businesses in LA on average have a larger projection factor (3.4)
Alaska is over-represented in our panel, so businesses in Alaska on average have a smaller factor (2.2) **

Card type bias

We see a larger percentage of consumer credit cards in the US than we see debit cards, EBT, HSA, or FSA cards. Therefore, business locations that heavily rely on these card types are now assigned a higher projection factor on average.

For example, grocery stores, which tend to see more EBT and debit card usage, now have an average projection factor of 3.2 (compared to a 2.9 average). Whereas electronics stores – where credit card usage is higher – now have an average projection factor of 2.6.**

Income bias

Lower income groups are underrepresented in our panel, in part due to their reliance on debit cards. So, on average, we’re applying a higher projection factor to businesses in zip codes with lower per capita income, and visa-versa. Examples include:

Lower income areas:

Business locations in Columbus, GA (31903) have an average projection factor of 3.3, where the average household income is ~$29k
Business locations in Detroit, MI (48208) have an average projection factor of 3.4, where the average household income is ~$28k

Higher income areas:

Business locations in in Arlington, VA (22207) have an average projection factor of 2.5, where the average household income is $249K
Business locations in St. Louis Park, MN (55424) have an average projection factor of 2.5, where the average household income is $249K**

Size of purchases bias

We’ve seen that our coverage of transactions differs depending on the size of transactions, where we are more likely to cover larger transactions. This could be related to the aforementioned debit bias, so a business location that has a higher proportion of larger transactions will now get a lower projection factor, and visa-versa.

Business locations with average transaction size < $20 have an average projection factor of 3.4
This compares to business locations with an average transaction size >$500, which have a projection factor of 2.4

So, what does this mean for Enigma customers?

First and foremost, it means that the card revenue estimates you are receiving from Enigma have become more accurate: we believe they are about 8% more accurate than before. The same is true for transaction counts and customer counts.

These improvements are effective as of 12/1/23, and will automatically flow into your next data delivery or API calls. You do not need to do anything different to start receiving these improvements.

If there are businesses that you’ve been tracking over time, you may see some meaningful changes to the card revenue estimates. These changes are expected, and represent an improvement in the accuracy. You will not see any changes to growth rates based on these projection factor improvements, because the projection factors have been applied to all historical months in our time series.

You can also expect card revenue accuracy to continue to improve from here, given that we started with a conservative model that we will continue to expand upon.

Stay tuned for more updates and product improvements as we continue to deliver on our mission of providing the most accurate revenues for private businesses across the US.__

*This projection factor is derived by dividing the total credit and debit card transaction volume, according to the Fed, by the total transaction volume that Enigma sees.

Using Enigma KYB for Business Identity Verification

Enigma — Tue, 03 Oct 2023 00:00:00 GMT

The Business Identity Verification Problem

Financial institutions (FIs) attempting to comply with Know Your Business (KYB) requirements and verify the identity of the businesses they partner with are faced with numerous challenges today.

First off, KYB rules are relatively new – first established in 2016 in the Customer Due Diligence (CDD) Rule and changing even as recently as the past few years. Secondly, developing a KYB solution in-house can be quite costly and time consuming. Understanding the relevant regulations and rules, adopting them into compliance standards and policies, and building technology to support an ever-growing customer base can all require a large investment and significant overhead. Finally, the options FIs currently have for external KYB solutions today are often costly and are built off of slow processes like manual verification.

In turn, FIs were looking for a solution that could adapt to the latest requirements, take the manual lift out of KYB, and reduce costs. Today, we’re announcing the launch of Enigma KYB, a solution to help you verify businesses and reduce onboarding costs via instant verification of Secretary of State (SoS) filings, risky activities classification, and OFAC screening.

How Enigma KYB Differs from Traditional Solutions

Traditional Providers’ Business Identity Verification

Traditional KYB providers rely on:

“Auto-Approvals”: Traditional providers supplement data gaps with manual verification that can take hours and days, leading to customer drop offs
Unclear pricing: Traditional providers tier verification based on various features needed, making the costs of KYB confusing and hard-to-predict
High costs: Many traditional providers bake in the costs of manual verification into simple verifications, leading to higher costs for you

Enigma’s Business Identity Verification

Enigma’s KYB solution offers:

Market-leading instant verification: We can automate more than 70% of your business identification process in under 3 seconds
Leading industry coverage of business registration filings: We instantly verify 1.5X more businesses than peers
Simple, low costs: We offer one set price across each business verified. These prices save you 50% on overhead if you have existing KYB providers and 80% if you don’t currently use an external data provider at all
Flexibility: We can help you integrate Enigma KYB into your existing data process or help you build a new KYB process
Speed: Streamline KYB from a multi-day process to near instant auto-approvals

How We Do Business Verification Differently

Wondering how we’re able to instantly approve more businesses than other providers while cutting overhead costs?

Traditional providers rely on perfectly inputted SoS filings in order to auto approve – then move to manual approval when filings aren’t complete. Enigma instead enriches SoS filings information with DBA names, websites, and operating locations of a business to be able to surface the correct SoS filing without perfect inputs.

Because we can instantly automate more than 70% of your business verification, as it pertains to registration filings, through this process, you can onboard more clients, for less rather than waiting for time-consuming manual approvals.

Enigma KYB’s Features

Enigma KYB offers:

Secretary of State (SoS) business registrations: Instantly verify businesses’ names, addresses, websites, and officer data across Secretary of State filings to approve more businesses, more quickly. Statuses are updated bi-weekly, making it easy to flag businesses that are no longer in good standing.
High-Risk Activity Classification: Discover whether businesses conduct high-risk activities – and make better decisions with that knowledge.
OFAC Watchlist Screening: Screen businesses and owners across watchlists that are refreshed on a weekly basis so you know who you’re doing business with. Enigma’s screening engine is used by a top 10 U.S. bank, has been audited by regulators, and is proven to have low false positives without an increase in false negatives.
Identity across all business firmographics: Retrieve Identity attributes, from operating names and addresses, legal names and addresses, industry classification, and industry-leading website coverage for performing lighter-weight due diligence.

Enigma KYB’s Results

In advance of the launch of Enigma KYB, we worked with a variety of financial institutions across payments, online banking, and KYB orchestration to test our efficacy compared to traditional providers.

We compared these FI’s traditional registration fill rates – e.g. businesses they were able to instantly verify without any manual steps – with their registration fill rates after using Enigma KYB. Decreased manual verification is critical for FIs looking to reduce overhead and process costs, as well as for stopping potential customers from dropping off during long verification wait times.

We increased registration fill rates by approximately 50%. With Enigma, these FIs increased their instant verification to 72.4% on average, across segments. We were particularly successful in the payments industry – one medical-focused payments provider, for example, doubled its registration fill rate with Enigma.

WANT TO LEARN MORE? CHECK OUT ENIGMA KYB OR WATCH A RECAP OF OUR WEBINAR, “HOW TO OPTIMIZE YOUR KYB PROCESS: BUILD IN-HOUSE, SINGLE PARTNER, OR WATERFALL.”

How Sales & Marketing Teams Can Target & Acquire the Right SMB Customers

Enigma — Thu, 17 Aug 2023 00:00:00 GMT

Businesses across the board are constantly looking for new ways to win over customers.

But proving relevance — and more importantly, need — for a product or service (especially amid economic uncertainty or in an over-saturated market) often requires a herculean effort. And with that, identifying and targeting the right prospects becomes increasingly crucial.

For sales and marketing teams, the customer acquisition challenge is magnified across the small business sector, often due to the limited data available on SMBs, making outreach efforts increasingly difficult for those trying to score small business clients. For example, not knowing whether a company earns $10K in revenue versus $10M in revenue could lead to incorrectly targeted, impersonal sales and marketing efforts — and ultimately result in a loss of business.

Having a more complete understanding of prospects, though, can change the playing field for companies looking to capitalize. Below, we dig into how sales and marketing teams can use Enigma data to level-up their efforts and more effectively target and acquire customers.

For a full deep-dive on this topic with our product experts, check out this webinar.

How sales and marketing teams can leverage Enigma data to win over prospects

Marketing to SMBs across industries is expensive, difficult, and time-consuming — and many organizations have fallen short of securing the right prospects to help grow their top line. As conversion rates continue to plummet across industries, marketing tactics need to expand beyond a one-size-fits-all approach and become more personalized than ever before.

Enigma has intelligence on every SMB across the United States, and part of what makes the dataset so useful is the inclusion of third-party data — like real-world revenue numbers not based on models — that, when paired with other data points, paints a comprehensive view of an SMB.

This means that for any US-based SMB accepting card revenues, Enigma can share transaction-level intelligence including, but not limited to:

Month-over-month revenue, which updates every month
Payments systems (Square, Toast, Stripe, etc.) being used by businesses
Information on whether a shop operates via e-commerce or offline
Location data
And much more

With more granular data, companies can target SMBs more effectively and win on the most ideal prospects within a given market. For example, trigger- and event-based marketing becomes more possible as sales and marketing teams have a more complete picture of SMBs.

Imagine a small business’s online card revenue hit $1M dollars for the first time. For some of our credit card customers, this is the difference between an SMB being eligible for a typical consumer card versus being eligible for a charge card — and with Enigma data, that distinction can be made and action can be taken once the milestone is hit.

In another scenario, a customer can benefit from knowing when an SMB in their target market has seen consecutive monthly growth. With this knowledge, a company can send out relevant incentives that otherwise may not be applicable to those with inconsistent growth patterns — and instead of misdirected, generalized marketing efforts, businesses receive hyper-personalized and relevant offerings that match their achievements.

Success story: How Enigma powered FMCG outcomes with real-world revenue data

FMCG, a Deluxe company, works with a number of financial institutions, including many regional and super-regional banks. The company had previously employed data providers to help 4 of their banking clients increase account creation across 3 major divisions:

Merchant services
Business lending
Business banking

Unsatisfied with initial conversion outcomes (attributed to inaccurate revenue data sources), FMCG looked to the use of our real-world revenue data to get the job done. The firm believed that by using Enigma data, it could more precisely target and segment businesses based on revenues and processing volumes, and thus improve outbound marketing efforts given the more personalized approach.

As a result, all banks involved saw more than 150% growth in new accounts created across their merchant services divisions. In one case, a super-regional bank saw a 295% lift. (Read the full case study here).

This is just one specific example of how Enigma has enabled sales and marketing success — but we’ve got plenty more to share. Read about how our data has helped salon software leader Phorest achieve similar outcomes, or reach out to our team to hear firsthand how our data is positively impacting clients across the board.

How to Build Your Ideal Customer Onboarding Process

Enigma — Tue, 15 Aug 2023 00:00:00 GMT

Customer onboarding can be tricky with the added hurdle of meeting Know Your Business (KYB) regulations. KYB refers to financial institutions' legal duty to verify a business, verify that business’s managers and owners, and monitor and track risk of that business over time. KYB can be a costly and time consuming process with elaborate verification requirements. Moreover, legislation is relatively new – KYB requirements were only established in 2016 – and are constantly being updated with new regulations.

In this blog, we’ll talk through the options that financial organizations have to meet KYB compliance regulations when onboarding customers, as well as how to choose the best process for your organization.

Customer Onboarding Solutions

Financial institutions (FIs) have three different options in meeting their KYB compliance and customer onboarding goals:

Handling the process fully in-house
Working with a single outsourced service and data provider
Waterfalling multiple data providers and orchestration platforms

Fully In-House

The What:

Many smaller FIs choose to take their KYB processes in house, instead of relying on a data provider or providers. Internal teams build processes and technology to approve or deny businesses based on KYB standards.

The How:

Invest: An FI uses their engineering team to build an auto-approval infrastructure in-house or builds a team to manually approve businesses. Most invest in a combination of the two – an auto-approval infrastructure and a manual team for when that fails.
Verify Businesses: Internal teams and/or data infrastructure pull the data needed to verify businesses – name, addresses, registrations status, and other filing details from Secretary of State (SoS) filings, at the minimum, and risky activities for a more sophisticated KYB process. Internal teams and/or data infrastructure also run businesses and their beneficial owners against the Office of Foreign Assets Control (OFAC) list to confirm they aren’t engaging with parties subject to sanctions.
Verify UBOs: Internal teams and/or data infrastructure pull the data needed to verify Ultimate Beneficial Owners (UBOs) – any person with a 25% stake in the business. The treasury’s FinCEN anti-money-laundering unit is currently promising to establish a database of UBOs in the future. However, currently, UBO info is sometimes present on the SoS filing and sometimes is not. Internal teams and/or data infrastructure also run UBOs against the Office of Foreign Assets Control (OFAC) list to confirm they aren’t engaging with parties subject to sanctions.
Monitor Businesses Over Time: The CDD Rule mandates that KYB is a continuous process, where FIs check and recheck information about the businesses and UBOs they work with. In-house KYB requires that FIs invest in updating their customers’ information and rechecking business legitimacy over time.

The Pros:

Control: You know your team and your business better than a third-party provider could. Building KYB in-house allows you to leverage this knowledge across every step of the process, and deeply customize a solution for your needs.
Good for simple KYB processes: If you have a simple product with a limited customer base and only need to pull from one or two data sources to meet your KYB needs, then building in-house could make sense. You can avoid a high initial cost from setting up a partnership with a provider, as well as build a process that works for you relatively easily.

The Cons:

Lack of expertise: Often organizations building KYB processes in-house may rely on talent with less specialized KYB knowledge. You may be building KYB infrastructure with a team of engineers lacking a compliance background, for example.
Limited auto-approval coverage: FIs typically develop less sophisticated technology to pull data, match data, and approve businesses with that data than a data provider who specializes in KYB auto-approvals and built their engineering team with that product functionality in mind. In turn, an FI’s internal system will likely auto approve less businesses – and thus may lead to lost customers who use another, faster provider instead.
Costly: Whether you dedicate an in-house engineering team to create an infrastructure for approvals (and data integration), rely on a large staff for manual approvals, or some combination of the two, there is a large overhead for internal KYB.
Time-consuming: Alongside financial resources, an in-house KYB process takes more of an FI’s own time to build and run.

Outsourced service and data provider

The What:

Some FIs work with one outsourced service and data partner to either supplement a preexisting, in-house KYB process or to help build their business onboarding processes from the ground up.

The How:

Invest: An FI partners with a single service and data provider, usually paying the provider a setup fee as well as annual fees to license that provider’s data to continually verify businesses over time. An FI’s single provider typically handles auto-approvals – so an FI doesn’t have to build an infrastructure in-house. FIs also invest resources into manual approval for businesses that can’t be auto-verified. Some partner with a single provider who provides a manual verification service on top of auto-approvals, other FIs auto-approve in-house.
Verify Businesses: The service and data provider auto-approves certain businesses and flags others for manual review. Businesses without an SoS filing; that have a mismatched address, name, or person; that conduct activities in high-risk industries like cannabis or adult entertainment, for example; or that are matched to the OFAC list are all flagged. Businesses that aren’t auto-approved are either manually approved in-house at the FI, or the provider offers a manual verification service for the FI (at an additional cost).
Verify UBOs: The service and data provider pulls the data needed to verify UBOs from SoS filings when present and runs UBOs against the OFAC list.
Monitor Businesses Over Time: In partnership, the FI and provider work together to update customer information and confirm business and UBO legitimacy over time. To do this, your vendor will check the status of SoS registrations periodically, re-screen for risky activities periodically, and re-screen against OFAC list periodically.
Establish Trust in External Provider: If FIs are concerned about moving beyond their own front door with their KYB solution, they may enact processes to monitor their data partner’s accuracy such as taking a small sample of auto-approved businesses every month to ensure that they’re correctly auto-approved.

The Pros:

The middle path: A single provider partnership can allow you to maintain some internal control while building a customized solution that works for you with just one dedicated partner.
Focus on core competencies: By outsourcing the majority of your compliance needs to a third party, you can focus on your core services and strengths.
A tight partnership: One service and data partnership allows you to work closely with just one team, learn to trust each other, and establish strong communication
Reduce overhead and costs: With more auto-approvals, you can spend less time onboarding and onboard more clients. Companies using Enigma as their sole KYB provider are estimated to reduce KYB costs by 80%, for example.

The Cons:

The middle path: With a single provider partnership you’re losing out on the control you’d have by solely working in-house as well as the broader scope of data you’d have access to with multiple waterfalled data partners.
Onboarding: With a new partner, you will have to spend time onboarding to a new platform and getting your team acquainted with new tech and new data.

Waterfalled data providers and orchestration platforms

The What:

Some FIs work with multiple data providers via an orchestration platform to meet their KYB compliance goals, “waterfalling” multiple datasets into their auto-approval process. Typically, an FI using multiple data sources does so with the help of a third-party data aggregation platform such as Alloy or Oscilar that integrates multiple data providers’ data into one singular KYB decisioning endpoint. The waterfall of data providers used by these platforms is usually based on both costs and approval times (e.g. latency), assuming the accuracy of all providers’ data is similar.

The How

Invest: Invest in a third-party data aggregation platform that uses multiple sources for auto-approvals for FIs. Once again, FIs also have to find a solution for manual approval either in-house or with a partner.
Verify Businesses: The data platform auto-approves businesses and flags risky businesses, businesses without SoS matches, and businesses on the OFAC list. To do this, the platform attempts to verify identity using one data provider. If this provider can’t automatically match the business, it is then passed onto the next data provider (and so on and so forth). Typically, a multiple-provider platform will have higher match rates for auto-approvals and more data on risky activities. Businesses that aren’t auto-approved are then manually approved.
Verify UBOs: The data platform auto-approves UBOs, with the help of matched data from multiple data providers. KYB requirements allow FIs to trust self-reported UBO information from businesses, unless they have doubt it isn’t true. One area where an FI might have doubt, for example, is when an owner name is present on an SoS filling but is different from the owner name on the business application. Additionally, the data platform also screens UBOs on the OFAC list.
Monitor Businesses Over Time: In partnership, the FI and data platform work together to confirm business legitimacy over time. Once again, an FI’s waterfall data platform – with multiple data providers – will check the status of SoS registrations periodically, re-screen for risky activities periodically, and re-screen against the OFAC list periodically.
Establish Trust in External Provider: Similarly to FIs who partner with one provider, an FI waterfalling multiple sources can conduct checks on each of those individual sources monthly to confirm auto-approval legitimacy.

The Pros:

Maximum coverage: The more data sources you incorporate into your KYB process, the more coverage you’ll have for auto-approvals of the businesses you’re working with.
Further reduced overhead and costs: Like partnering with a single provider, data waterfalling allows you to auto-approve more businesses – and thus onboard more customers, more quickly. Waterfalling, however, multiplies these auto-approvals and savings. Enigma, for example, cuts costs an extra 50% for FIs already using a provider.
Adaptable: Using a system built to incorporate multiple providers and accommodate changes to KYB legislation will help you adjust your KYB process as needed over time. You can easily incorporate new data sources as your needs or laws change.
Goes beyond KYB: An upfront KYB check can also help with fraud checks, risk checks, and underwriting, and it could be long-term beneficial to set up infrastructure that enables the integration of different data sources to tackle different data needs

The Cons:

Multiple parties to work with: When waterfalling multiple data sources, you’ll need to work to establish communication and trust with multiple providers. You’ll potentially lose control and customization as you incorporate more providers.
Overkill: If you have relatively simple KYB needs that pull from only 1-2 data sources, you may not need a tag team of multiple data sources to meet your needs.

How to Choose the Right Customer Onboarding Solution For You

While sifting through the pros and cons of each of these solutions may feel overwhelming, the key to choosing the right solution for you is simply thinking through both your goals and your needs.

Want absolute control over KYB? It might make sense to keep your process completely internal. Want to have a close partner to guide you through a messy legislative environment? It might make sense to partner with just one service and data provider for all your KYB needs. Want to maximize auto-approvals with the highest business coverage you can find? It might make sense to waterfall multiple providers into your process.

If partnering with one data provider or waterfalling multiple providers seems like the right fit, Enigma KYB can help you jump-start or streamline your current KYB process.

Curious to learn how you can instantly approve more businesses, while reducing your KYB costs? Learn more.

What is KYB and How Do You Meet KYB Requirements?

Enigma — Tue, 08 Aug 2023 00:00:00 GMT

What is Know Your Business (KYB) and it's History?

What is KYB?

Know Your Business (known as KYB) refers to a financial institution's duty to verify a company’s and company owner’s identity before doing business with them. The goal of identity verification is to avoid working with any companies involved in money laundering, fraud, or other financial crimes. KYB regulations are similar to Know Your Customer (KYC) regulations, but instead ask financial institutions to verify the legitimacy of both businesses and their Ultimate Beneficial Owners (UBOs), rather than a named, single person.

In this guide to KYB, we’ll look at the history of KYB legislation – and where we’ve arrived today, who has a duty to comply, and how KYB requirements can be met.

KYB History and KYB Today

KYB requirements are rooted in U.S. anti-money laundering laws, which have been evolving since established in 1970 via the Bank Secrecy Act, the first anti-money laundering law in the U.S. The Bank Secrecy Act set the foundation for financial institutions’ duties to help the government find and stop money laundering.

In response to a major money laundering scheme involving an international bank, Congress enacted the Annunzio-Wylie Anti-Money Laundering Act (the “Act”) in 1992, which, among other things, established customer verification and recordkeeping requirements specific to wire transfers and authorized the U.S. Treasury to require financial institutions to file suspicious activity reports (“SARs”). Over the next half dozen years, Congress and regulators passed a number of laws and rules to further strengthen U.S. anti-money laundering rules.

In 2001, in response to the 9/11 attacks, the government passed the Patriot Act, which established modern KYC compliance rules, and assists in helping businesses avoid terrorist financing.

In 2016, further Know Your Business rules were established, in part a response to the Panama Papers to make it harder for criminals to hide criminal activities in shell companies. FinCEN established the Customer Due Diligence (CDD) Rule in 2016 to identify a customer’s ultimate beneficial owner or owners and not their nominees or ‘‘straw men.” The Rule now mandates that financial institutions doing business with other businesses must verify the identity of these businesses and anyone with ultimate beneficial ownership (e.g. more than a 25% stake in the business) and at least one executive officer (CEO, CFO, managing director, etc.).

As such, foundational KYB legislation is relatively new and continued legislation has been changing the KYB landscape – even as recently as the past few years. In 2022, FinCEN established a registry requiring certain domestic and foreign corporations, limited liability companies (LLCs), and similar entities to file a report with the federal government identifying the entity’s ultimate beneficial owner, to come into effect January 2024.

The obligations to know your customer expand beyond financial institutions. For instance, the INFORM Consumers Act, passed this year, requires online e-commerce platforms to engage in identity verification of high third party businesses on their platform.

The rapid evolution of legislation can make it difficult for financial institutions to keep up with their compliance requirements.

Challenges Amid A Changing KYB Landscape

In addition to rapidly changing legislation, there are numerous other challenges that can make KYB compliance difficult. If you choose to do KYB in house you’ll face challenges like:

Long manual processing times
Lost customers due to long processing times
Lack of comprehensive data on all customers

If you choose to work with traditional KYB vendors, difficulties could include:

High costs of implementation and data
Low auto-approval rates due to lack of data coverage

Wondering if you have to comply amid this changing legislation and challenging environment? If you’re a financial institution, the short answer is yes.

Who Needs to Conduct Know Your Business (KYB) Processes and How Can They Meet Requirements?

Who Has to Comply

Institutions that have to comply with KYB regulations include:

Banks
Securities brokers
Mutual funds
Fintechs
Futures commissions merchants
Payment providers
Marketplaces
Other financial institutions

There is good news, however. Know Your Business compliance can provide a competitive advantage to institutions willing to invest in business verification data. You need to know information about a business’s industry, legitimacy, revenues, and owners so you can onboard with confidence, tailor marketing to clients that matter, and retain customers’ business more easily.

How Do You Comply With Know Your Business (KYB) Requirements?

If you’re a financial institution working with other businesses it’s critical to meet the CDD Rule’s mandate to verify a business, verify that business’s managers and owners, and monitor and track risk of that business over time. The first step in meeting KYB requirements is collecting data on a business, including:

A business’s name and aliases
A business’s addresses
A business’s proof of active registration
Whether or not it conducts business in a high-risk activity
Whether it belongs on a sanctions list

Simultaneously you must collect data to help you understand the legitimacy of the business’s owners – anyone with a greater than 25% stake in the business– and one executive officer. You’re looking for:

The owners' and executive officer’s names, dates of birth, addresses and social security numbers/tax identification numbers
Any individuals on crime or sanctions lists

How Enigma Can Help You With Know Your Business Requirements

Know Your Business – the process of confirming the identity and legitimacy of a business and its owners – has many challenges, from constantly changing legislation to the high financial and time costs of verifying businesses in-house.

Enigma KYB can help you jump-start or streamline your current identity verification or know your business process – while better understanding the businesses you’re onboarding. Enigma verifies business instantly: our state-of-the-art matching algorithm helps us match and verify more businesses, enabling 1.5x higher auto-approval rates than other leading providers.

Curious to learn how you can instantly approve more businesses, while reducing your KYB costs? Contact us or read our blog on building out KYB processes and business onboarding

Dev/Stage/Prod is the Wrong Pattern for Data Pipelines

Enigma — Wed, 02 Aug 2023 00:00:00 GMT

Sometime in the mythical past, we software developers learned that testing in production was a bad idea. We developed a pattern of testing and deployment based on dev/stage/prod environments (or some variant of these concepts):

dev to integrate our changes
stage to sanity check that it all worked together
prod to make it available to customers

The primary benefit of maintaining distinct dev/stage/prod environments is isolation. We can test the impact of our changes in one environment without affecting the one our customers use.

However, for data pipelines of more than moderate complexity, achieving isolation via dev/stage/prod environments leads to predictable problems that we’ll explore below. Fortunately, there is a different pattern that we call a pipeline sandbox that solves these problems.

First, let’s see where the dev/stage/prod pattern breaks down for data pipelines. The key characteristic that makes a data pipeline different from other applications is the time and expense it takes to test. If the pipeline takes multiple hours and thousands of dollars to run, it may take multiple hours to uncover an issue with a change. The place where that change manifests itself may be very far upstream from where we made the change. Other problems include:

When multiple changes are tested at the same time, it is often difficult to attribute effects we see in the data to a particular change. Diagnosing the specific cause of data issues becomes so difficult that it can paralyze teams.
While the data pipeline is running, it effectively freezes out other developers from applying their changes. Changing logic while the pipeline is running also risks contaminating a measurement your teammate is planning.
Data state is difficult to manage in dev/stage/prod environments. The data sets in a pipeline may run into the hundreds of terabytes. Synchronizing data across environments is often necessary for accurate tests on a non-prod environment. Establishing the correct data state at scale is both time-consuming and logically complex.

An alternative to dev/stage/prod is a pipeline sandbox. A pipeline sandbox is a fully functional and fully isolated version of the data pipeline whose logical state corresponds to a git branch.

For a developer working in a pipeline sandbox, the system behaves exactly as if they were developing on the production pipeline except:

Changes applied on their git branch are only applied to their sandbox
The data state is writable snapshot of a specific version of the prod pipeline

This gives the developer confidence that:

The results they observe on their sandbox are consistent with prod
Any unexpected results are caused by a change they introduced

Assuming we’ve built an efficient data infrastructure, each developer can create a pipeline sandbox in minutes. This relieves developers of the concern that they are interfering with each other's work during development. After a developer validates their change on their pipeline sandbox they can merge their branch into main and apply to the production pipeline.

Pipeline sandboxes also reduce costs. Developers no longer copy the upstream data for every data state that they create; instead they simply generate a parallel version of the data state from a certain point in time that they can overwrite with their changes.

When changes tested in different sandboxes run together, we may still discover issues because of unexpected interactions among the changes. But this is both a more tractable problem and less likely to occur because we’ve already tested the individual changes in isolation. We also still have the sandbox available to help with our investigation.

A pipeline sandbox generally consists of the following system components:

A git branch that contains the complete logical state of the pipeline
A DAG in the workflow orchestrator that translates this logical state into executable instructions (often these are run in a separate distributed compute environment)
A writable snapshot of the production data

While creating pipeline sandboxes may sound daunting, great tools have become available in recent years to make this easier. Two of our favorites are:

LakeFS provides a git-like abstraction over a data lake and makes it easy to efficiently create data branches, which are copy-on-write versions of a pipeline’s data state.
Dagster is a next-generation workflow orchestration tool that offers both isolated DAGs and data-centric orchestration semantics.

The biggest challenge in creating pipeline sandboxes is consolidating the logic state of the pipeline in a single git repo. Frameworks like dbt that provide consolidated logical state are loved by their users. If you’re not using a framework that consolidates logical state (like us), it’s more challenging to consolidate pipeline state, but still entirely feasible (we’ll cover this in the next blog post).

Although they’re not the right pattern for data pipelines, dev/stage/prod environments still have their place in our data architecture toolkit. Components of the data infrastructure that we manage – such as workflow orchestration environments, metastores, etc. – benefit from dev/stage/prod environments to safely manage upgrades. The distinction is that these are infrastructure components rather than components that define the semantics of our data pipeline.

Empowering developers with data pipelines required significant investment – particularly the task of consolidating our logical state into a single git repo. That investment has paid off in terms of higher team velocity and developer satisfaction. Prior to adopting pipeline sandboxes roughly half our team’s bandwidth was expended dealing with environment issues. We’re now fully focused on improving our business logic, ML models, and performance optimization.

Why B2B Marketing Data Platforms Fall Short When Targeting SMBs

Enigma — Wed, 31 May 2023 00:00:00 GMT

With so many B2B marketing products out there, why are so many marketing, sales and customer success teams unhappy with the options? We interviewed dozens of teams and found it came down to a single reason: most B2B offerings aren’t made for marketers targeting small-and-medium-sized (SMB) businesses.

SMB marketers have long been bucketed into general B2B marketing, or even consumer marketing. However, given the unique challenges of marketing to SMBs, we believe it deserves its own category: something we’ve describe as “B2SMB” sales and marketing.

That’s why Enigma recently launched a Sales and Marketing data platform for teams targeting SMBs. In this post, we’ll dive into three major challenges that have plagued B2SMB sales and marketing - a data challenge, a marketing challenge, and a business challenge - and how recent changes in the SMB ecosystem are enabling sales associates, customer success members, and marketers to meet these challenges.

The data challenge: SMB diversity mixed with a lack of data

SMBs are defined by diversity - in size, in industry, and in structure. Enigma clients targeting small businesses mentioned diversity in:

Financial size and structure: A SMB can have $10,000 or $10,000,000 in revenues, a 1,000x difference in the size of the business and its potential value as a customer..
Industry and sub-industry: SMB diversity is also seen in the variety of industries and sub-industries. Highly granular data is needed to differentiate among businesses with drastically different value and conversion likelihoods. For example, one card processing sales organization was extremely frustrated with marketing data that only got to the granularity of “sit-down restaurant.” Since they had several staff that could speak Thai, their specific target was actually Thai restaurants.

SMB marketers are faced with a dilemma: they have a strong need for data to help sort diverse SMBs, yet there is sparse intelligence on small businesses. Additionally, these teams can’t take a traditional enterprise approach (e.g., assign a rep to each individual lead and customize every deal), because there are so many small businesses.

Data Development: Data to enable differentiation and intelligence on SMBs

A variety of tech trends have converged to make data on SMBs more dynamic and easier to collect. This data, in turn, has enabled financial services companies to segment the highly diverse SMB market. Data sources include:

SMB-owned websites and third-party reviews: With more web information on SMBs, it’s become easier to size and classify SMBs using information on a SMB’s website itself, as well as third party information like the number of reviews.
Open banking: Open banking, pioneered by the likes of Finicity and Plaid, enables a collection of banking and financial information where some businesses opt in and input their bank account login and password. This, in turn, enables financial services to collect detailed information on each business.
Credit card growth: To supplement permissioned open banking data, financial institutions can use credit card revenues and growth as a proxy for size for many industries. When combined with available web information, credit card data can help sales and marketing teams understand the size, financials, and contacts of the SMBs they are targeting.

Enigma Sales and Marketing sits at the intersection of web data, credit card data, and hundreds of additional private and public sources. Card data, in particular, makes understanding SMB size and health easier: McKinsey found that 82% of Americans used some form of digital payment in 2021, up significantly from 72% in 2016. As consumers use these payment types, businesses adapt to changing preferences. Industries that previously didn’t accept credit or debit card transactions – like construction or HVAC – now see increasing card penetration, while the share of cards is increasing within card-dominant industries.

The marketing challenge - Spending on SMB-first marketing is costly and reliant on major digital ad players

SMB marketing costs are rising due to, traditionally, a heavy reliance on digital ad players. A Profitwell study found that costs of customer acquisition (CAC) increased 60-75% for B2C and B2B businesses from 2014 to 2019, for example.

At the same time as costs rise, marketers are paradoxically getting asked to grow with lower costs. In an August 2021 CMO survey, 59% and 45% of marketers said there was growing pressure to prove results from CEOs and CFOs, respectively. With ever decreasing ROI of the major digital ad platforms, marketers are searching for alternative means of acquiring leads.

Marketing Development: Direct sourcing for SMB leads

In order to spend less for better results, marketers need to invest in direct sourcing of SMB leads. This includes both purchasing data, as well as investing in data driven tools to improve ROI, CAC, and program efficacy. Data that includes contact information, near-real-time revenues, buying intent, and industry granularity, can help marketers get ahead.

With this data, sales and marketing teams no longer need to be solely driven by referrals or web leads through high digital ad spend. Instead, they can instead optimize tactics and outreach.

The business challenge - SMBs have been overlooked as a category and as a diverse set of verticals

Small and-medium-sized businesses (SMBs) – defined as independent businesses having fewer than 500 employees – represent a significant market, with approximately 33.2 million small businesses in the United States accounting for 99.9% of all businesses in the country. However, World Bank and the International Finance Corporation (IFC) estimate that there is about a $5.2 trillion global finance gap in what SMBs need versus the amount they are actually able to get through traditional loans.

Business Development: Explosion in financial services that are SMB-specific

The success of SMB-focused financial services has proven that seemingly small sections of the SMB market can represent giant market opportunities. Square, for example, has become a behemoth in card processing and payment facilitation due to its relentless focus on the smallest of merchants, i.e. so called “micro-merchants” earning less than a million a year. By making it free and painless to set up a card accepting terminal with nothing more than a mobile phone, Square became a 10B+ company.

Within SMBs, vertical SaaS has shown that industry-by-industry focus can build highly successful companies.

According to Fractal, more vertical SaaS companies went public in Q1 to Q3 of 2021 than in any prior year. From Toast in restaurants to Phorest in hair salons, many of these companies are targeting specific SMB verticals to serve unique needs from a seemingly small portion of the SMB economy.

In part, the growth of these SMB-focused platforms is due to growth in data surrounding SMBs that come from companies like Enigma. However, the open banking ecosystem, as well as data from card processing, has also helped enable merchant cash advances.

A data tool built for you

The number of successful financial services businesses serving SMBs has increased overall demand for partnerships. The expense of off-the-shelf SMB marketing has pushed financial services firms to search for new solutions. The explosion of available data has made a solution for sales and marketing teams targeting SMBs possible.

Enigma’s newest product – Sales and Marketing – is a comprehensive data platform to help you target, acquire, and up-sell SMBs in your ideal customer profile.

Learn more here

Four Things We Learned from Author and Former-SBA-Administrator Karen Mills About SMBs

Enigma — Tue, 30 May 2023 00:00:00 GMT

Enigma gathered top female financial services executives from companies like Citizens Bank, Comerica Bank, and Amazon for an intimate round table discussion with author, HBS Senior Fellow, and Obama-Era SBA Administrator Karen Mills on the opportunities and challenges in the small-and-medium-sized business (SMB) economy. Despite the fact that “over half of the people who work in the US economy own or work for a small business,” Mills asserted that she “often had to pound the table in the West Wing to make sure the voice of small business was heard.”

We wanted to share four of the things we learned during the discussion from Mills about the challenges – and opportunities – in the SMB economy.

1. Why SMBs are under-appreciated and underserved

SMBs face a lending gap where they are less likely to receive traditional bank loans compared to larger businesses. Mills suggests the reason for this is twofold: information opacity surrounding SMBs and heterogeneity of SMBs. Information opacity, explained Mills, refers to the inability of lenders to access accurate and complete information on SMBs such as full-year revenue statements or operating status. Heterogeneity, meanwhile, is the idea that small businesses are diverse – in revenue, industry, and operating model – making it hard to have one lending model that fits their various needs.

Mills suggested that these two problems can be addressed when alternative data and technology intersect with small business lending for a solution she calls “The Small Business Utopia.” Under this Utopia, more financial institutions would have access to data like Enigma’s – near-real time revenues, event-based triggers, payment processing information, etc – that would form a more accurate picture of the financial health of the business borrower and lead to better and faster credit decisions.

“I've been very pleased to be connected to Enigma because they were early as a data aggregator,” said Mills. “People are using [this data] in various ways which turn out to be predictive.”

2. How the SMB landscape has changed since Mills first published her book in 2018

In 2018, Mills released the first version of her book, Fintech, Small Business & The American Dream: How Technology is Transforming Lending and Shaping a New Era of Small Business Opportunity. She is currently writing the second edition, reflecting on what will change the future of small business lending, including:

Technological advancements: AI, automation, and machine learning
Business health data: Revenues, fraud, technographics, etc. from providers like Enigma
Innovation entrepreneurs: Fintech innovation and disruption of traditional players

The new edition is scheduled for release in early 2024.

3. Which businesses are falling through the cracks and how can we reach them

According to Mills, the problems facing SMBs are most amplified for the smallest of small businesses who are early in their development and need technical assistance across loans, payments and more. “The very smallest and the hardest to reach entrepreneurs are the ones that are falling through the gaps,” said Mills.

Mills thought data and technology could be one solution for banks to make these sorts of loans more quickly. However, Mills warned data alone is not enough: local relationships and community banks in combination with tech and data will be key.

4. How automation is affecting which businesses are getting loans

Mills was particularly interested in how automation – where trained computers rather than humans make loans – could help or hurt small businesses in lending decisions. One concern Mills pointed out was training a computer on existing data could lead to an algorithm that reflects the biases in the current lending environment.

She referred to an important recent study looking at Paycheck Protection Program (PPP) loans – to which Enigma contributed data – spearheaded by NYU Professor Sabrina Howell. In the study, process automation was found to “reduce the racial disparities in credit access through enabling smaller loans, broadening banks’ geographic reach, and removing human biases from decision-making.” PPP loans, pointed out Mills, don’t have a credit screen and – when given out through automation rather than manually – distribution became more equitable.

Ultimately, Mills has mixed thoughts on automation: “I end up with these two sides of the equation. One possible scenario is that ‘AI will be bad and automation will be bad because it will be built on past biases.' But also ‘that automation can take away bias,' as we saw in the recent research on PPP," said Mills. “Those are the two conflicting scenarios out there.”

INTERESTED IN LEARNING MORE ABOUT THE DATA LANDSCAPE SURROUNDING SMBS? CHECK OUT ENIGMA’S REPORT.

How Alternative Data Can Improve Credit Risk Decisioning

Enigma — Tue, 30 May 2023 00:00:00 GMT

“If your data is more accurate, especially on financial metrics, you'll have better underwriting decisions, fewer delinquencies and loss ratios and be able to approve more,” said Charles Zhu, VP of Product at Enigma. Zhu joined Ziv Shabat, VP of Analytics at Noble in a conversation on the key challenges that credit teams face today when underwriting small-and-medium-sized businesses (SMBs). They discussed why accurate data is hard to find, how non-proprietary data can aid accuracy, and more.

Why accurate data is important and why it’s so hard to find

Zhu said having good data on financial metrics was critical for many onboarding steps for credit risk teams, from improving underwriting accuracy to decreasing delinquency. But he added that good data can be hard to find.

“A lot of folks are hesitant to share their data, there's not a lot of data on small businesses. It’s a much sparser world of data as compared to, say, the consumer side of things,” said Zhu.

Zhu pointed to the idea that many credit risk decision-makers are only given three-months of data when a SMB applies. This data doesn’t show that SMB’s change over time and, additionally, could be an inaccurate portrait of an SMB’s performance in other seasons. A fireworks stand, for example, is more likely to have higher revenues in the months surrounding July 4th, rather than in the fall where no seasonal holiday is boosting demand.

Even if you decide you want to onboard a new client, the data permissions and integration process can lead to drop off.

“The second you ask for someone's bank account login or some kind of QuickBooks login – especially for a small business – we see typically a huge drop off in SMB interest in the lending product right there,” said Zhu. “50% to even 90% of folks in an onboarding flow will just drop off when you ask for information like that.”

Zhu also addressed the fact that even after a company is onboarded, outdated data can make portfolio management trickier. You may have had accurate data when the company applied, but when you’ve been working with them for 1 to 2 years during portfolio management and monitoring, revenue and growth may have changed. This makes it difficult for you to adjust credit lines.

How non-proprietary data can aid accuracy

Third party or non-proprietary data, explained Zhu, can serve as an orthogonal signal – i.e. data that shows a different part of the picture than what businesses opt-in to showing you – and complement preexisting internal data sources. This is important, added Shabat, so that you can see beyond your data or a too-narrow common consensus built from a limited set of traditional big data sources or your internal sources.

“If you're built only on proprietary data, you have a sort of a tunnel vision of the world,” said Shabat. “You only see what you see, and you don't know what you don't know…seeking out non-proprietary data really expands that [knowledge].

Shabbat further explained this idea via an example: a customer may have never missed any payments to you, but has missed payments elsewhere. Having data about that business’s behavior elsewhere could serve as a bellwether to that customer’s future behavior with you, or prompt you to have a conversation with them for more confidence.

“It's important to diversify the kind of [data] sources that you're using, balancing between both your information – which is accurate and best fits your business – as well as the external indicators that help you grow your business and keep it healthy from macro economic events that you're not aware of,” said Shabat.

Zhu added that this sort of investment across a wider range of data sources can also help with fraud prevention.

“If you're only using proprietary data or really what the business is self reporting, we're seeing fraudsters get increasingly sophisticated, and showing you that they're very healthy,” added Zhu. “But the next day, all of a sudden, you're out half a million dollar loan.”

Data landscape today

Zhu and Shabat then dove into today’s data landscape – what data is out there for lenders and how can they access it?

Traditional data resources include:

Personal Information from businesses/owners themselves
Credit Bureau data
Small Business Financial Exchange data

Alternative data sources include:

Permissioned data
Pre-permissioned data

Permissioned data, explained Shabbat, is where you as a lender need a sort of permission or action from your customer to receive that information. Plaid, Netsuite, Codat and Rutter operate on a permissioned data model.

Enigma, however, provides pre-permissioned data with the hope of transforming the lending landscape so healthy businesses can get access to credit. No data provider can currently truly capture total revenues and cash flow in a pre-permissioned way right now, but Enigma can help merchant cash advance lenders with card revenue transaction data that covers nearly every card-accepting business in the country. Before a business is onboarded to a card processor, we can give a better sense of what is an appropriate merchant cash advance (MCA).

“Pre-permissioned data also brings the value of objectivity,” said Shabbat. “You're not asking your client or your customer for that information. Once you ask for permission you are giving some power to your customer to show you what kind of image they want to show.”

If they have several bank accounts, for example, they can connect the one with the most cash or greatest cash flow and hide bad debts.

Challenges implementing new data sources

While alternative data can help lenders expand their SMB universe and answer more questions about current and future clients, lenders may worry about finding ways to add this data into pre-existing workflows to derive value from it.

“It’s critical to find a partner who can provide some kind of service, some kind of data science to help operationalize and implement the data for you.” said Zhu. “Every customer's portfolio looks different, every financial product is different, and you want some kind of ability to customize that data and develop some kind of score that's appropriate for you.”

Additionally, lenders may worry about finding value with non-proprietary data. All the data in the world isn’t useful if it’s overwhelming or too hard to understand. Zhu says the answer to this challenge is transparent data partners.

“Ideally, you understand why the score is working, and the underlying variables that are driving the score,” said Zhu. “Being able to map some of these model scores and model variables onto heuristics and rules becomes really important, because that enables organizations to understand why this is actually working.”

Opportunities in alternative data

While lenders may find challenges in the alternative data space, there are also opportunities. First off, alternative data can be used across your organization including in growth initiatives and marketing campaigns.

“You don't just have to use [pre-permission data] for risks, but you can actually use it for sales and marketing,” said Zhu. “We see a ton of applications like pre-qualifying the healthiest businesses or giving special offers for the healthiest businesses.”

Alternative data can also help you to better define your ideal customer profile (ICP). More information beefs up your knowledge of economic factors outside your immediate team, sector, or organization.

“When targeting high performing industries and verticals, I think a lot of lenders will sometimes look across all kinds of businesses,” said Zhu. “But with pre-permissioned data, you can start targeting the verticals that are performing in the stagflationary economy the best.”

This article is based on a webinar presented by Enigma

Introducing Enigma Sales and Marketing

Enigma — Wed, 10 May 2023 00:00:00 GMT

Enigma launched its first data platform for sales and marketing teams a year ago. Our customers eagerly adopted this using the product across pre-qualification, prospecting and lead prioritization use cases, resulting in increased conversion rates and new revenue. But we knew our data could still do more to help marketing and sales teams.

We spent months closely collaborating with our customers to investigate the data challenges faced by growth teams trying to engage small and medium businesses (SMBs). We’re now launching Enigma Sales and Marketing with new features and data attributes to improve the entire SMB customer acquisition and growth process.

How Enigma’s Sales and Marketing data platform can help you

Enigma Sales and Marketing provides accurate, timely data to help teams improve SMB targeting, acquisition, and retention.

Sales and Marketing helps you define your ideal customer profile (ICP) and target that segment, through unique data attributes like monthly revenues, revenue growth, granular industry, payment technologies used, and more. When you target the right businesses, you reduce CAC and wasted spend.

In inbound marketing, Sales and Marketing helps you automatically qualify, onboard, and score SMB leads with a real-world view on size and value. This saves time and removes friction for your best prospects, driving revenue growth. Early customers saw a 7x improvement in accuracy in predicting conversions on inbound leads.

Sales and Marketing also helps outbound efforts with features that help you identify, get alerted about, and maximize conversions with your top accounts from a database of 30 million SMBs. With data, you stop wasting resources on businesses that aren’t a fit or are unlikely to convert. One of our customers saw a 2-3X lift in new account conversion from outbound efforts with Enigma.

When managing accounts, Sales and Marketing helps you to prioritize and grow your most important existing accounts with a real-world understanding of a customer’s financials. You can identify which accounts are likely to grow or churn with card revenue, growth, share of wallet, and other data. Additionally, you can allocate your relationship managers to the most important accounts and drive more revenue through cross-sell and upsell.

Enigma’s Sales and Marketing data platform top features

Real world revenues and growth data: Built from our proprietary panel covering more than 40% of U.S. debit and credit cards, Enigma provides card revenues, revenue growth, average transaction size, refunds, and other transaction-level intelligence.
Contact data: Titles, emails, LinkedIn profiles, phones, and addresses for five contacts per business to help you reach the SMB decision-maker you need
Trigger events: Maximize marketing campaigns’ conversion rates with the ability to build lists and segments based on key financial events in a business’s lifecycle
Payment processing technographics: Enigma helps you learn what payment technologies the SMBs you care about are using both in-person POS and online
Operating status: Easily identify businesses confirmed to be actively processing sales and earning revenue to optimize your marketing programs.

Want to learn more? Download a sample data pull, tailored to industries that matter to you, or watch our webinar, "How to Elevate Your SMB Programs with Enigma's Sales and Marketing Platform".

How Data Can Help Sales, Customer Success, and Marketing Teams Manage SMB Accounts

Enigma — Tue, 09 May 2023 00:00:00 GMT

Since the beginning of the year, the Enigma team interviewed over 50 sales and marketing leaders at financial services, merchant services, and verticalized SaaS companies. In these engagements, we explored their workflows and challenges, and asked what their ideal solutions might look like. It became clear that there is a big gap between desired capabilities for using data in sales and marketing and what the market currently offers. This served as inspiration for our subsequent product roadmap.

In Part One we discussed inbound and outbound marketing workflows and challenges, but in this piece we’ll focus on the challenges that sales associates and marketers face in small-and-medium-sized businesses (SMBs) account management as well as some solutions that can help.

SMB Account Management

For most financial services sales and marketing teams interviewed by Enigma, account management – the practice of growing an existing customer base with upsells and cross-sells and renewals – followed a similar flow:

Customer success (CS) or sales representatives met with SMB clients quarterly or monthly
CS teams collected usage data and anecdotal information from these meetings
CS teams created a customer health score
CS teams designed interventions to increase health of customers
CS teams renewed contracts, up-sold, or cross-sold

However, across each step of this journey, CS, Sales and Marketing teams reported challenges. Even with regular calls, Customer Success and Sales Leaders mentioned customers that churned with little notice. These leaders wanted more ability to set expectations that someone might be churning, get reasons as to why clients’ budgets were decreasing, and identify which clients were actually slowing down and which weren’t.

When creating a customer health score, CS representatives often had to rely more on the anecdotes of the customers themselves or a gut feeling, rather than data-driven evidence of these customers’ performances.

When attempting to cross-sell or up-sell, sales teams needed more data on expansion opportunities. For many payments and payment services companies, getting accurate revenues and share of wallet is critical. Figuring out whether they remain the dominant payment service or card processor is an early signal of their competitiveness in the market.

Enigma’s team worked to understand potential solutions like:

Third party objective view of SMBs’ performances: A look into the entire SMB landscape with more detail on full-economic performance across industries and revenue bands
Understanding individual SMBs’ revenues over time: An objective look at health and performance over time to arm sales and customer success teams with the data needed to up-sell and cross-sell

The Role of Data

With the learnings over the last year of interviews, Enigma started to build a sales and data platform for financial services, merchant services, and vertical SaaS. This platform provides a third party objective view on the size, growth, and financial health of all businesses and near-real-time intelligence on contacts and technographics.

We started to zero in on the data features that would supercharge sales and marketing teams efforts across specific use cases. For every use case, having ML-powered analytics to help build lead scores for prioritizing leads, churn scores for detecting churn risk, and finding lookalike populations was key.

We will give more information when we unveil our full solution set and how to use it. In the meantime, we suggest you check out Part 1 for information on the challenges and solutions facing inbound and outbound marketers targeting SMBs today.

How Data Can Serve Inbound and Outbound Marketers Targeting SMBs

Enigma — Mon, 08 May 2023 00:00:00 GMT

Enigma’s sales and marketing customers have consistently been frustrated with the state of available data on small and medium businesses (SMBs). To dive deeper into why, the Enigma team interviewed over 50 sales and marketing leaders at financial services, merchant services, and verticalized SaaS companies.

In these engagements, we explored marketing and sales teams’ workflows and challenges, and asked what their ideal solutions might look like.

One thing that became clear for us – and the impetus behind several ensuing product releases – was the big gap between desired capabilities for using data in sales and marketing and what the market currently offers.

In this post, we’ll dive into the challenges inherent in inbound and outbound marketing to SMBs and the data solutions needed to solve them. In Part Two of our series, we’ll look at the challenges sales and customer success associates face in account management.

Let’s get started!

Inbound lead prioritization

Most of the marketing teams from smaller companies we spoke with saw themselves as an inbound-first marketing team. For these companies, a typical customer acquisition flowed like this:

Visit: User arrives on the company's website.
Signup: User signs up with an email and potentially additional information about themself (business name, size, etc.) and becomes an inbound lead.
Lead prioritization and resource allocation: Company tracks engagement on the inbound lead through actions like clicks on content. More advanced marketing teams create a lead score - when the user gets a certain lead score, they are routed to sales and different kinds of engagement depending on score.

However, across each of these steps, marketing and sales team faced common challenges. During signups, companies struggled with both junk signups – non-operating businesses, spam, etc. – and an inability to identify great signups from the chaff. Given the rampant increase in fraud, especially in financial services, some junk signups end up being actively malicious. Other valuable leads sign up with only a personal email address and marketing teams fail to realize they have come from a desired account or may be a good prospect.

One Enigma customer on a financial services marketing team pointed out that over 90% of sign-ups end up taking no further action or are ineligible for workflows. Another said, “We get a firehouse of leads from web sign-ups and the biggest morale killer for my team is to call after hundreds of leads that just aren’t eligible for our financial products. You just kind of give up.”

After signups, when marketing teams were trying to prioritize leads they also struggled with finding an objective truth about the information inputted by inbound leads. Information inputted by companies themselves wasn’t necessarily true. Third-party data sources that marketing leaders tried out in the past, meanwhile, had low coverage into their leads and didn’t help with conversion. Without enough data or the correct data, marketing teams had difficulties creating a lead score, sometimes relying more on gut instinct.

Enigma worked with these marketing and sales teams to think through potential solutions. Our engagements showed that there was a lot of low-hanging fruit that marketing teams could take advantage of to internally optimize their workflows. Sales and marketing teams were particularly excited by external data solutions such as:

Size of individual merchants: Having a sense of a merchant’s size was considered key to scanning through leads quickly and prioritizing the best ones.
Accurate revenues, transaction sizes, and payment processing stats for individual merchants: Many leads drop from sales and marketing flows if asked to provide banking or financial information too early on. Pre-permissioned data – that doesn’t require an immediate ask of bank connections or three months of bank statements – could help marketing teams keep valuable leads that might otherwise drop off.
Holistic lead scoring: When sales and marketing teams rushed to develop lead scores – without holistic data – prospects weren’t prioritized properly or didn’t convert. With both an objective, third-party view into a merchant’s value as well as internal data on engagement, sales and marketing teams are able to better allocate resources based on data, not gut.

Outbound lead prioritization

Fewer sales and marketing teams we spoke to had as mature of an outbound motion as they did inbound. Teams with a mature outbound motion were larger or had a very specific target in mind like an industry vertical or a revenue band. Outbound marketing was also more popular among vertical SaaS companies.

Outbound marketing – defined as creating an account list and specifically going after these named accounts with direct mail, email, events, tailored outbounds, etc. – typically followed the workflow below:

Define Ideal Customer Profile (ICP) universe. ICP mostly defined by industry, revenue size, etc.
Retrieve ICP business list: Get list of businesses in ICP universe or filter list of all businesses down to ICP
Get contact info: Find ways to contact businesses to prospect against such as addresses for direct mail marketing and emails for email marketing
Search for intent: More advanced marketing teams sought to use trigger data and narrowing and targeting to businesses that have a high intent. For example, highly seasonal businesses looked for working capital to build inventory before the high season.
Launch campaign

Once again, challenges faced by teams were relatively consistent.

When defining an ICP universe, marketing teams are tasked with asking strategic questions - Who should we be targeting? Who do we want to target? What do we need to build to expand our target market? Early on, it’s easy to be anecdotal, but as customer bases grow many teams wanted to incorporate data into their decision making. Without data, these teams were worried they were missing changing or new affinities from their potential customer bases.

Said one customer, “We think we have an idea of who would want to use us, but at any level deeper than a few sentences, it’s pretty anecdotal.”

Then, when retrieving ICP businesses, marketing teams struggled to answer what criteria they should prioritize on and where to get the information to make these decisions. How could they get an overview of the full SMB universe to cherry pick leads that met their ICP?

After teams found the businesses they wanted to target, they were then tasked with getting contact info. Contact accuracy was considered key for outbound marketers, however, many found outdated addresses or names. Enigma’s team was surprised to learn how many times front-line sales and marketing reps were asked to call businesses that had shut down already, for example.

Marketing and sales team also wanted more information to signal intent – data that could help teams discover when a SMB might be looking for the services they were offering and when they were looking for it. Teams wanted data for “triggers” and “buyer intent” – key events in a business or a business owner’s lifecycle that indicated when a business might be most open to a new service.

When launching their sales and marketing campaigns, financial services and merchant services providers mentioned the need for transaction-level intelligence, from POS systems used to accurate revenue growth rates in order to create specialized offers. They wanted to give the best prospects pre-qualification and special offers.

Said one payment processor sales representative, “If I can find prospects who are paying too much on their current POS system and are processing over 1M a year, and give them a $5,000 cash bonus to switch over to me, I can win them over half of the time.”

Once again, Enigma and these marketing teams discussed potential solutions like:

Objective truth into the entire SMB landscape to define ICP: A third party view of the entire customer with data on industries, sub-industries, sizes, etc.
Answers to how to choose the best businesses to outbound: Teams wanted something similar to Facebook’s and LinkedIn lookalike populations but that they could directly download and use for their outbounding.

Contact info to reach SMBs: Accurate contacts that are verified and trusted across all channels - people, titles, emails, phone, address.

The Role of Data

For outbound prospecting, contact accuracy across all mediums is a colossal pain point. More so, figuring out the best population to prospect – due to a recent event in the business’s lifecycle or similarities to the current customer base – is critical. For both inbound and outbound prospecting, we share our merchant services customers’ zeal for understanding prospects’ current payment or POS systems.

We’re excited to share our insights with you and are asking you to keep an eye out for May 10, 2023, when we showcase the Enigma SMB sales and marketing data platform built for financial services, merchant services, and vertical SaaS.

Want to learn more? In Part Two we discuss the challenges that financial services marketers and sales teams face after customers are acquired in account management.

Getting a Clear View of Revenues at Seasonal Businesses

Enigma — Thu, 30 Mar 2023 00:00:00 GMT

Small business lenders are constantly searching for the right balance between gathering critical information from loan applicants and streamlining the application process. Lenders want to see the historical financials of a business, but as more financial information is requested, more applicants will drop out of the process.

When underwriting seasonal businesses, this challenge becomes even more acute. Most lenders, especially alternative lenders, ask for three months of bank statements. But with only three months of history, it’s impossible to understand the full impacts of seasonality on a business.

The result? Lenders face a high level of uncertainty and risk when underwriting these kinds of businesses.

To better offset this added risk, many lenders are looking to alternative data to supplement 3-month bank statements from applicants. The right alternative data can help by providing a more accurate view of annual revenue and highlighting any seasonality in a business’s revenues.

What are the risks of using 3-month bank statements?

One problem with using 3-month bank statements is that they can be highly misleading for businesses with seasonal shifts in revenue.

Most lenders rely on annual revenues, so they end up multiplying the revenues present in 3-month bank statements by 4 to estimate a business’s annual revenue. This projection is subject to seasonality, which means a lender can severely over or underestimate the business’s revenue.

Why do lenders request 3-month bank statements?

So why have 3-month bank statements become the norm for many lending decisions? Lenders face a trade-off: if they ask for more data at the time of onboarding, they can make better lending decisions… but friction increases, resulting in fewer applications. The less data that’s requested at the time of onboarding, the better the applicant experience.

Lenders have found that asking for 3-month bank statements versus longer-term bank statements significantly increases the size of their application funnel. Additionally, some ISOs (Independent Service Organizations) provide 3-month bank statements by default as part of a data packet on each small business lead, while other ISOs tend to guide lenders to use 3-month bank statements to increase the funnel.

What are the costs of underestimated or overestimated business revenues?

If applicant revenues are underestimated, SMB lenders miss out when they either decline healthy businesses that would have been able to make payments, or offer low loan sizes or lines of credit.

On the other hand, when revenues are overestimated, lenders end up taking on losses, approving businesses that may not be able to make payments, or providing loans or lines of credit that are too large for the business to repay.

Fireworks and tax accountants: real-world examples

To illustrate this problem, let’s consider three real seasonal businesses taken from Enigma’s database, which includes the current and historical revenues of more than 16 million businesses.

Our first example is a fireworks retailer that mainly sees revenue around the 4th of July every year. If this business applies in the winter months, it will look as though it has no revenue. As a result, its application will be declined despite sizable and growing annual revenue.

On the other hand, if this business had applied right after July, its annual revenue would be severely over-estimated, and the credit line or loan size given would also be overestimated and risky.

Similarly, accounting (CPA) firms primarily see revenue spikes during tax season. Below is a graph of two CPA firms. If either of these businesses apply in the fall, they will appear as having very low or no revenue, and their applications will be declined or handed low credit lines. If they apply right after tax season, their annual revenues will be severely overestimated, and the credit line offered may be too large.

As you can see in the below table, the range of error for a lender relying on 3-month statements to underwrite any of the above three businesses would have a range of error of -100% all the way up to +400%.

How to supplement 3-month bank statements to de-risk lending (while protecting your funnel)

One solution to this problem is using alternative data to augment the view of a business’s revenues. Alternative data can help de-risk the usage of 3-month bank statements by providing a more accurate view of annual revenue. It can also help lenders better understand the seasonality of a business’s revenues.

At Enigma, we provide the full history of card transactions at a business going back to January 2017. This means lenders can get a clear view of seasonality and identify highly seasonal businesses based on past trends. Enigma’s data also enables lenders to project more accurate, seasonally-adjusted annual revenues from the 3-month bank statements they receive from applicants. Instead of just multiplying those statements by 4, lenders are now able to apply the appropriate projection based on historical revenues and seasonality trends.

Enigma’s data is pre-permissioned, which means there is no need to request permissions and create friction for applicants. It also eliminates the need for stipulations that reduce the acceptance rates of loans.

This results in the ability to safely underwrite more healthy businesses, reduce risk by identifying businesses that aren’t a fit, and reject businesses that aren’t a fit earlier in the funnel, saving costs.

Interested in learning more or getting a sample of Enigma’s data? Get in touch.

2023 SMB Data Landscape Report

Scott Steinberg — Sun, 26 Mar 2023 00:00:00 GMT

Intro

Our customers often ask us about the landscape of companies providing data about small businesses. Unfortunately, there aren’t any magic quadrants or industry reports that we can send them. The small business data space doesn’t get a lot of outside attention despite the fact that it is growing and innovating rapidly.

Enigma is revolutionizing the way businesses access critical information. As a company that has been working with data about U.S. businesses since 2011, we’ve acquired a deep knowledge of this landscape.

We hope that this report will provide B2SMB companies with a resource they can use to understand:

Which companies specialize in data about small businesses?
What kinds of data does each company provide?
Which companies are a good fit for which types of data problems? (seen in the more detailed report)

Methodology

Who made the list

Our first landscape report features 26 companies. In order to make the list, a company must meet the following criteria:

Providers of data: Provide B2SMB customers with data about small businesses in the U.S.
Broad U.S. SMB coverage: Data must be relevant for >50% of U.S. SMBs.
There are many companies that specialize in data about businesses in specific industries (for example law firms or medical practices). We did not include any of those businesses in this report.
Small business focus: The companies must explicitly focus on data about SMBs or cover all business types but have strong coverage of small businesses for B2SMB companies.
For example, Demandbase and 6sense are great B2B data companies. We did not include them in this report because they focus more on enterprise and mid-market businesses.
Adoption: Company must have some demonstrated adoption in market.

It’s also worth noting that this report focuses on companies who are providing access to data, not the creators of the original data points. For example, in the case of open banking companies, Plaid and Finicity are providing access and structure to banks’ data. The banks and SMBs are the data creators. Plaid and Finicity are the data providers.

Grouping like companies together

We categorized each company based on:

Their core differentiation in terms of the data or service they provide, and areas of specialization
What use cases they help customers with
Primary sources of their data
The permission rights for using their data

You can access a detailed view of each company's categorization in our accompanying interactive table.

For the landscape visualization, we decided to group companies based on their core differentiation in terms of the data or service they provide. Within each of these groups, we formed subgroups based on company specializations. This primary grouping view provides the most relevant information about what each company specializes in.

In the interactive table, we’ve sliced and diced the landscape according to all of the dimensions. Request access here.

What do you think?

This is our first SMB Data landscape report. Given the amount of innovation and growth in the industry, we plan to update the report regularly.

Did we leave someone out or mischaracterize a company? Let us know - we’d love to hear from you.

Q&A: Sarah Burrows of Modern Sprout

Enigma — Wed, 15 Mar 2023 00:00:00 GMT

How are today’s small businesses navigating growth across channels and what role can access to capital play?

The Enigma Blog sat down with entrepreneur Sarah Burrows, co-founder of Modern Sprout, to discuss D2C vs. wholesale, the advantages of PO financing, and how to navigate periods of rapid growth.

Enigma Blog (EB): How do your products reach your customers? What are the different sales channels that you rely on?

Sarah Burrows (SB): Now there's such an emphasis on omnipresence. And I think one of our greatest strengths is channel diversification. So we are now focusing more of our energy into our DTC platforms, while still having a foundation of brand awareness with credible retail, which includes everyone from Target to Nordstrom.

Omnipresence is becoming increasingly more and more important for a brand, and fortunately channel diversification is one of our greatest strengths. Our products reach customers by being on-shelf at key retailers, ranging from thoughtfully curated specialty boutiques to Whole Foods and Target. Online, we work with strategic ecommerce partners including Uncommon Goods and Food52, and have recently been investing energy into our D2C platform.

EB: How do you think about the split between wholesale and ecommerce retail?

SB: For the majority of the Modern Sprout’s lifespan, we’ve been focused on wholesale…because keeping up with the demand was all we had time for. However, within the last two to three years the pandemic steered our focus towards our online platforms, as we experienced a major increase in traffic and sales on our website. In the midst of such a rapid growth phase, you’re constantly reacting and it can be a challenge to be proactive. We’re still primarily wholesale, but ecommerce is growing exponentially for us and we recently relaunched a new website in the fall of ‘22. It’s exciting, because there’s a lot of opportunity in that channel – a lot of low hanging fruit.

EB: With a newer focus on the digital side, what has your journey been in terms of selecting tools for ecommerce? How did you make those decisions?

SB: We were able to bring in a seasoned pro to lead our ecommerce department, which has been a game-changer. It’s easy to become completely overwhelmed by choices and determining what’s the right fit for our business, especially given our size and our growth trajectory.

We didn’t want to buy a tool that had way more functionality than we could use and we came across that a lot. But we also needed software that could grow quickly with us. Our previous site was a hybrid between WordPress and Shopify, and now we are exclusively on Shopify.

There are so many levers that we have yet to pull. But we've started: we upgraded from MailChimp to Klaviyo for email marketing, and we're in the process of integrating platforms like Gorgias to help with our customer service support and Bazaarvoice to help with review acquisition strategy.

We went through rapid growth in 2021 and everything changed for us. It enabled us to bring on new leadership positions and also empowered us to commit to new software platforms to support our growing business.

EB: What was your experience like getting capital for Modern Sprout?

SB: We started Modern Sprout on Kickstarter and crowdfunded $80k. After that we invested a modest amount of our money and bootstrapped it – we ran profitably and reinvested those profits in growth. We did a small friends and family raise that has since been paid out to almost everyone. When we started working directly with suppliers on full shipping containers we went out and found an amazing partner that specialized in PO [Purchase Order] financing. PO financing is incredible for a business like ours because it gives us access to capital as soon as we receive the order, long before the orders ships and we are paid for it.

PO financing is incredible for a business like ours because it gives us access to capital as soon as we receive the order, long before the orders ships and we are paid for it.

EB: What’s one challenge you faced when your business experienced rapid growth?

SB: It was hard for us to create a common persona of the Modern Sprout customer. The appeal of our product is so wide-ranging, which is amazing, but it can also make it challenging to decide where to focus.

When we started the brand, we were designing for ourselves and our needs: urban dwellers, who didn't have a lot of outdoor space but wanted to sustain an indoor garden for cooking. That was the catalyst for all of these ideas around how to make gardening more accessible for more people.

We learned a lot through our retail partners. We had a really hip store from Brooklyn that called saying a specific product, our original garden jar, just flew off the shelf and they needed more. The exact same day, a boutique in Fargo, North Dakota called and said the exact same thing. Just a few weeks later, we had an inquiry from Lowe's about our hydroponic planters and literally the same day, I had the same product inquiry from Goop for a very high-end retail outpost in Malibu. Those were defining moments for us, and a validation of why the products we were bringing to market were so needed.

EB: What’s something you think people get wrong about entrepreneurship?

SB: I don't have a metric to back this, but it feels like there are more and more people becoming entrepreneurs. Certainly, there are more tools to enable people to launch a company with fewer resources than ever before. However, a lot of people are scared to take that jump. And I think it's correlated to how people evaluate risks. In the modern world, we are led to believe that the resources are scarce and that we need a lot of things to feel safe, content, and fulfilled.

I really think that that hinders innovation. People are afraid to take the leap and temporarily live without the things that they think they need. Someone once said to me, “but I won't be able to contribute to a 401(k).” You’ve got to accept that it’s just a totally different game.

We took this risk. We had support so we always knew we would be fed and have shelter; we were fortunate in that. So, it was really just about learning how to enjoy the ride and knowing that if you don't hit the success threshold that you're aiming for, you'll most likely be able to pick yourself up, dust yourself off, and start again, but this time with a lot more insight.

EB: What’s something you think people get wrong about sustainability?

SB: I feel very passionate about sustainability. There's a lot of greenwashing out there. One of our biggest objectives is to grow the business in a way that is honest and open. We have to be accountable for every aspect of our business. We can control what’s happening under our roof, but the responsibility goes beyond that – we have to prioritize finding vendors that align with our values and, once we find them, challenging them to, like us, continue to innovate better ways to do business for the planet.

Enigma is proud to support small businesses like Modern Sprout. For more insights on the small business economy, explore our State of the SMB Economy report.

4 Ways Financial Institutions Can Leverage Data-Driven Marketing in 2023

Enigma — Wed, 01 Mar 2023 00:00:00 GMT

What does innovation really mean for marketers at financial institutions in 2023?

“It’s a question that keeps me up at night,” says Elissa Rodd, who leads the Product Strategy and Innovation group for omnichannel marketing services company FMCG by Deluxe.

“But I tend to look at it as what it's not,” she adds. “Innovation, to me, is not standing still, right? If you are standing still, you are not innovating. It means you're constantly moving forward. You are building new or elevated products.”

She does think Deluxe, which specializes in data-driven campaigns for financial institutions, has the best data lakes in the industry, but “I’ll never really let myself believe that,” she says. “Because if I do, then I'm going to stop innovating. I'm going to stop moving the business forward. I'm constantly learning more about what's out there.”

In a January 2023 webinar, Elissa joined Enigma’s VP of Marketing Madeline Ross to break down her four top tips for leveraging data in your financial institution’s marketing campaigns — in today’s volatile market and tomorrow, no matter what the future holds.

1. Evaluate the data

Data-driven marketing isn’t a one-note strategy. It’s a multi-step process that begins with evaluation.

✅ Cover the bases

Data “coverage” — essentially, a set of data that’s large and inclusive enough to constitute an accurate sample — is key.

There’s “an added component with coverage, though,” says Elissa. “Sometimes I'll evaluate a data source and see businesses I've never seen before. That's super exciting. But we always validate what we see.”

For example, she might see data suggesting businesses that appear new, “but they're actually bankrupt or out of business,” she adds. “That data is not useful to me.”

✅ Keep it fresh

Incremental or insufficient coverage (as in the example above) is a sign that data is old or “stale,” Elissa notes.

In order to validate how accurate and “fresh” a given data set truly is, the team at Deluxe needs to compare it to the data it knows to be true.

That’s why it’s critical to keep “truth files” of information known to be real, via direct observation and/or measurement, like empirical evidence.

✅ Test, test… and test some more

There’s a saying about software testing: If you don’t like testing your product, chances are your customers won’t test it either.

The same might be said for data. Chances are, insights from untested data won’t lead to better business outcomes. So Elissa recommends beta-testing your campaigns and any other projects — live.

“That's where you get the key performance metrics you want,” she says.

There’s a misconception about using third-party data for B2B marketing: that all data is created equal. That’s emphatically false.

“I talk to data partners all day long, and I can tell you: They all tell me they have the best data,” Elissa says. The only way to ensure truth is to test and validate it yourself.

2. Build a solid foundation of data

“Data-driven” is more than a buzzword to Deluxe. Being data-driven means “every single marketing decision is rooted in data and analytics,” says Elissa. “And when I say everything, I mean everything.”

That includes everything from ad design, including imagery, copy and calls to action to scripts for cold calling. Data also informs how you determine your target audience, how you build analytic models and how you interpret the results at the end of each campaign.

“I like to joke that even our data is data-driven,” Elissa says.

The basis for a data-driven marketing strategy begins at home, so to speak: with the data your organization collects from its own customers and audiences.

First-party data is “your bread and butter,” says Elissa. But “there's a fine line between collecting data and annoying your customers,” she adds.

If you're buying a tube of toothpaste online and the retailer asks you to provide your income, marital status, family size and whether you have pets, you might not decide to check out. Elissa does think, however, that companies should take every opportunity to capture one or two data points at a time — to slowly build a “rich profile” on each customer.

First-party data “enhances what third-party data can do, and it also can help build really great models and know your customers better,” Elissa advises.

When it comes to third-party data, she emphasizes the importance of testing yet again.

“Talk to lots of data partners, see what they have to offer and test,” says Elissa. “And then based on testing, onboard those that are really working for your unique problems.”

3. Maximize impact despite market volatility

The economy is a rollercoaster, especially as we emerge from the pandemic. How can businesses maximize impact through all the ups and downs?

✅ Stay the course

“Whenever people hear ‘volatility,’ they get scared and they want to turn off marketing programs,” says Elissa. “We saw a lot of that in 2020.”

But she recommends staying the course — even “doubling down” on marketing in difficult times.

“We always like to say, If you're not talking to your customers, someone else is,” she says.

In order to prevent that from happening, organizations benefit from “being out there” so their messaging breaks through.

✅ Practice data hygiene

“Good and healthy data” is essential in a volatile market.

“One of the reasons we love Enigma is it gives us really great health indicators, [such as whether] businesses are active,” Elissa says, noting that “trying to figure out when businesses were going under” is a pain point for Deluxe and other companies in the fintech space.

✅ Create flexible payment schedules

“Deluxe believes in marketing programs in good economies or bad economies,” says Elissa. “So we offer pay-for-performance marketing, which allows us to take on a lot of the risk and our clients to feel better about going out in these volatile times — because they know we're going to get the results they want.”

In practical terms, that can take many forms. In Deluxe’s case, it includes flexible payment schedules with partners.

✅ Look for customers’ ‘triggers’

Elissa defines “trigger marketing” as “the intersection of relevance and timing.” On the consumer side, typical triggers are major life events like getting married, having a baby, buying a house and retiring — which lead to large purchases, big financial decisions and other predictable behaviors. Deluxe found that in a given year, 10% of consumers switch their financial institutions. Two-thirds of that 10% do so in conjunction with a major life event.

“We like to call these ‘hand-raisers,’” says Elissa. “These are the people who are most likely to interact with your brand, to convert to your brand.”

On the business side, she notes that new businesses tend to exhibit predictable behaviors as well, especially in the area of finance. They often need business checking accounts, payment services, merchant services and financial expertise. “Trigger data” can help identify these potential clients and provide timely reasons to reach out to them.

These are the people who are most likely to interact with your brand, to convert to your brand. 'Trigger data' can help identify these potential business clients and provide timely reasons to reach out to them.

✅ Meet customers where they are

It's fairly easy to find consumers on social media and other digital channels like email. But on the business side, it's a little bit more difficult. However, “we still do that ‘surround-sound’ marketing,” says Elissa — focused on the channels data suggests are most likely to resonate.

Sometimes data can reveal unexpected truths.

“I know that nobody ever wants to hear that direct mail is still king — or queen — but it's true,” she adds. “We have data points that suggest that. But I think it's twofold — the trigger marketing and also just making sure we're reaching people in the channels they're most likely to engage with.”

4. Beware of common B2B marketing data mistakes

Truly data-driven strategies are more precise than traditional market research, but they’re not without pitfalls.

Poor entity resolution

That's super important in this space. It is really hard to do entity resolution with businesses, often, because businesses will have so many names that they operate under.

“I might have a flower shop, and my legal entity name is ‘Elissa Rodd, LLC.’ But my storefront says ‘Elissa's Perfect Petals,’ and my business banking checking account says ‘Elissa Rodd.’”

While we may recognize as humans that those three business names are connected, it’s difficult to make those connections at scale — often, amid tens of millions of rows of data. When it comes to the loan qualification process, for example, it’s common for some of those businesses to fall through the cracks.

If it didn't work before, try and try again

Just because data-driven marketing hasn’t worked for your organization in the past, that doesn’t mean it won’t be effective in the future. Things change — and that can be a big advantage. It could be the market, an algorithm or a shift in how a partner collects data.

“Not all data is created equal,” Elissa says. So she always warns against assuming that if you tried it once, it wouldn’t work if you tried again.

Big data doesn’t necessarily mean massive data.

“You don't need a massive amount of data,” says Elissa. “We [at Deluxe] need a massive amount of data. Enigma needs a massive amount of data. Our data partners need a massive amount of data.”

But to harness the power of data-driven marketing, your organization doesn’t necessarily need millions of terabytes. Collecting (clean, clear) first-party data from your own customers may be plenty to make a huge impact. The right data is the data that solves your business problems.

This article is based on a webinar presented by Enigma.

How Payment Processors Can Gain Visibility Into Opportunity Size

Enigma — Fri, 17 Feb 2023 00:00:00 GMT

In late 2021, a customer we’ll call Mariel missed her goals by 20%.

Mariel is a VP of Customer Success at a company that provides payment processing and other services to merchant accounts. Her problem? She was seeing a decline in processing volumes across several accounts.

Was an economic slowdown to blame? Were her accounts beginning to switch over to a competitor? It was impossible to say with the visibility she had. Mariel wasn’t able to explain to her boss why the declines were happening, or to suggest a concrete action plan to fix the issue.

Miguel, another customer working at a merchant processor, faced a different problem. Miguel oversaw 10 relationship managers covering 1,000 merchant accounts. Miguel wasn’t confident that his relationship managers were prioritizing the right accounts each quarter.

Miguel needed a way to help his relationship managers prioritize their time on the merchants where they could have the biggest impact: not only the accounts that were at risk of churning, but also those with the highest growth potential.

Though Mariel and Miguel appear to have different problems, they both are in need of the same solution. It’s a concept known as “total opportunity size” (or share of wallet, share of spend, share of business, share of purchases – depending on the context). And figuring it out is key to expanding and prioritizing your existing customer accounts.

Mariel needed to understand the full processing volumes at her customers — the “denominator” of business activity for a particular customer.
Similarly, Miguel’s team could better prioritize accounts if they could understand how their share of business compared to the total potential revenue at each account.

What is total opportunity size?

Total opportunity size is an umbrella term to understand how big of a business opportunity a customer or prospect might represent. Opportunity sizing may take many different flavors.

For many service or software providers, opportunity size is understood as total wallet size — how much a company is willing and able to spend to solve a particular problem.

For merchant or payment processing companies like Mariel’s or Miguel’s, opportunity size usually refers to the total processing volumes or total gross merchandise value of a merchant.

Why is understanding total opportunity size important?

Total opportunity size is critical to understanding how much room there is to grow with a merchant. Clarity here can be powerful for a variety of business decisions:

Prospecting: Understanding the revenues and processing volumes of an account enables Marketing and Sales teams to more effectively segment their target database. Opportunity size can be a key input into lead scoring and Ideal Customer Profile (ICP) targeting, improving ROI on campaigns.
Prioritize the Customer Success team’s time: Based on accounts’ upside potential, you can better prioritize how your Customer Success team allocates their time. Some companies create account scoring models to reflect this.
Inform strategy: From a strategy perspective, understanding total opportunity size with customers gives you clues about your product or services’ relative performance and potential across regions, territories, divisions, or product lines. This can help guide how you should best allocate resources.
Guide product decisions: Opportunity size can also guide your decisions around product investments. Based on where opportunity size is high but your penetration is low across an entire sub-vertical, you might decide to increase spending on innovation and new product lines and features to gain share of wallet.
Benchmark sales performance: For Buy Now Pay Later (BNPL) companies, visibility into total opportunity size can also help benchmark performance for salespeople across territories. If you know upside potential across accounts, you can understand the performance of different territories and assign performance comps appropriately.

Why is it hard to determine total opportunity size?

Total opportunity size can be tricky to calculate because many companies lack the right data.

Total revenue figures for private companies aren’t readily available, or at least haven’t been available historically.

Often, companies will try to gather this information manually. Team members will ask about total processing volumes during sales calls or integration calls, and then enter that number into the CRM. The problem is, this intel may quickly go stale or could be inaccurate to begin with. And that kind of manual process doesn’t scale effectively.

Another common approach is to look at proxies of business size that are easily observable. For example, headcount may be used as a proxy for the size of the revenues or budget a business has. This type of approach can give a directional indication of whether one business on average is expected to have higher processing volumes than another, but it can’t predict the specific processing volumes of a business within any reasonable range of error.

What kind of data can help me understand opportunity size with customers?

The good news is that there are now sources of data that provide visibility into the revenues of any business that accepts credit cards. With Enigma’s Merchant Transaction Signals, you can see the monthly revenues and growth trends of more than 16 million card-accepting merchants.

Credit and debit card transaction data is a merchant’s financial footprint. And with the increasing adoption of credit and debit card payments, this data has become a valuable tool for gaining an accurate understanding of total opportunity size — quite possibly the best available proxy for total revenue and trends for most Main Street businesses.

We partnered with Miguel to see how Enigma’s card transaction data could help him figure out where to prioritize his account managers’ time.

By introducing credit and debit card transaction data to his account intelligence, Miguel could see clearly his company’s share of wallet at each account and trends in total processing volumes at each account.

Miguel decided to prioritize the team’s time on two kinds of accounts:

Accounts where his company’s market share was dropping
Accounts where total processing volumes were rapidly growing

In the following months, Miguel’s team was able to see which accounts needed their attention most and spend their time accordingly. Churn dropped and processing volumes grew, enabling the team to hit their goals.

Understanding total opportunity size can help guide strategy, better allocate customer support and account management resources, and proceed with confidence as you prioritize accounts for expansion.

And at a time when we’re all doing more with less, tools like credit and debit card transaction data can offer intelligence about opportunity size at scale, far outperforming one-off, manual methods.

Learn more about how card transaction data is giving go-to-market teams an edge: get the guide.

B2B Data Enrichment — Tools and Best Practices

Enigma — Thu, 02 Feb 2023 00:00:00 GMT

Your business success starts with your data quality. This is especially true when it comes to data for sales and marketing. Regardless of how good your products and services are, how talented your sales and marketing teams might be, or how much budget you allocate toward customer acquisition, you’ll be less successful if you’re starting with data that is incomplete or outdated.

To put your business and your prospecting efforts on the right path, you need consistent, reliable B2B data enrichment to ensure that you have the best information to target the most qualified leads with speed, clarity, and confidence.

What is B2B Data Enrichment?

You likely have a lot of customer data (or company data). But do you know exactly where the data derives from, or how up-to-date it is? That’s where B2B data enrichment comes in.

Also known as data appending, B2B data enrichment is using external, third-party data to improve the customer information that you already have. This includes not only firmographic data and data points about companies you’re targeting (such as industry, location, revenue, and contact information), but also personal contact data about employees who can serve as your contacts (such as job title, phone number, email address, department, decision-making ability, and social media accounts).

No matter how conscientious you have been in acquiring and curating your existing data, you can still have entries that are:

Missing critical details about a prospect
Full of irrelevant details you don’t need
Inconsistent across your entire company, depending on where they came from and who entered them
Outdated, based on when they were last used or evaluated

B2B data enrichment fills in these gaps, updating your ideal customer profile data with the accurate, relevant information you need to properly evaluate business prospects and make more informed decisions.

Benefits of B2B Data Enrichment

When you prioritize data enrichment, your business sees many tangible benefits, especially on the lead generation and customer experience front.

More targeted outreach

When you’re working with more complete, detailed, up-to-date information, you can better define your ideal customer profile (ICP) and target prospects with actual buyer intent more directly with your sales outreach and marketing campaigns.

More efficient performance

Save time, effort, and money across your entire enterprise by enriching your data. Your marketing and sales teams don’t have to spend precious hours trying to score dead or incomplete leads, and you don’t have to invest money on outreach for prospects who either don’t fit your targeted profile or aren’t ready to buy.

A more successful sales funnel

The more you know about your prospects, the better you can tailor your approach and create a relevant sales pitch that really connects. By providing a better experience for would-be customers, you can increase your engagement and your conversion rates.

A competitive advantage

You and your competition are likely going after the same business prospects. The company that has the best small business database and works with the most powerful, accurate data sources and insights is the one that has a distinct advantage over the competition, and has a better opportunity at converting those prospects into customers.

Smarter business decisions

Remove the guesswork from your day-to-day operations. When you base your business on better data, you can make more informed decisions, whether you’re prospecting for new customers or even revisiting the services you’re providing to existing customers.

B2B Data Enrichment Tools & Best Practices

To make the most of your B2B data enrichment initiatives, you should follow these best practices:

Define your goals: Like any business initiative, clearly define up front what you want to achieve, the data you need to enrich, your goals, your KPIs for success, and your process for getting there.
Evaluate your current data set: Find out where the gaps in your current data fields are. Maybe you need to fill in missing customer criteria like contact details, update out-of-date information, or just set specific targets for data accuracy within your CRM.
Plan for ongoing enrichment: B2B data enrichment isn’t a one-time project—it should be a routine part of your business that you continuously plan and budget for, the same as you would any other part of your sales and marketing efforts.
Create a repeatable, scalable process: As part of your ongoing data enrichment efforts, the process you choose should be scalable across your entire organization and simple enough that it can be repeated easily at regular intervals.

B2B data enrichment can be done manually. However, because that is a major time and resource commitment, many organizations opt for a variety of enrichment tools and software plug-ins to help streamline the process. For example, some tools can integrate with platforms like LinkedIn and Gmail to easily mobilize your teams’ research and outreach efforts.

There are also a number of data enrichment services that companies can partner with to elevate their data sources even more, often done automatically without any extra effort needed on your end.

As with any sourcing decision, you should evaluate any potential data enrichment tool or service to ensure that it:

Provides the quality data you need
Works with your budget, staff, and workflows
Aligns with both your present and future enrichment goals

B2B Data Enrichment to Boost Your Business

Ultimately, B2B data enrichment is about making sure that your sales and marketing teams are using the best possible data sources when working with a target audience for customer acquisition. For companies that are interested in having the highest-quality data to make their prospecting decisions, Data as a service (DaaS) providers like Enigma are a powerful option.

Using our proprietary blend of online, offline, public, private, and third-party data (powered by machine learning), Enigma provides a complete picture of any small and medium business. We consider everything from government filings to social media posts to assemble the type of actionable data you need to make informed decisions, including business identity, transaction, revenue, profitability, and risk factor data, so you can better identify and target your prospects.

Search and sort by custom lists, find new prospects, and re-evaluate existing prospects—quickly and efficiently, with APIs and generated reports that integrate into your current systems and workflows, so you don’t have to overhaul your processes and personnel.

If you want to learn more about Enigma, our data, and the ways we can help your B2B sales cycle, sign up for your free demo today.

Entrepreneur Paige Graham on Growth, Seasonality & the Best Merchant Tools

Enigma — Tue, 27 Dec 2022 00:00:00 GMT

Paige’s Candle Co. is a homegoods merchant based in New York City and a small business collaborator for us on custom gifts.

We caught up again with founder Paige Graham on how her business is evolving, her favorite tools, and plans for growth. (Catch up on our previous interview if you missed it.)

Enigma Blog (EB): What’s your biggest small business lesson from the past year?

Paige Graham (PG): We’re learning how to be more conservative with major purchases.

During 2020, the market was all over the place, and I didn't know when I would have specific items available. One time, we had a shortage of our best-selling jar size for about four months. So in 2021 I focused on building up inventory exponentially. And for 2022, I’ve been making sure I'm more conservative and not basing my numbers off of panic buying.

But COVID effects are ongoing, so I’ve just been watching my terms with suppliers and trying to be fair for my consumers pricing-wise — and practicing patience.

EB: What are your sales channels? Is there a seasonality for homegoods?

PG: We sell across a few different channels. We do in-person sales at craft markets and trade shows. We sell and ship to customers online through our website. We do brand partnerships. Wholesale is another aspect, where we ship to various locations that resell the product.

In general, the fall and winter seasons are the best time for in-person markets. Brand partnerships spike during the fourth and first quarters, sometimes into second quarter. For wholesale, peak times are early spring and late summer.

EB: What does the next stage of growing your business look like?

PG: My philosophy is to find a niche and blow it up. I get a lot of questions about how to start and operate a candle business. We’re really excited about a new direction coming early 2023: we’ll be launching digital programs to teach people how to start and run a successful candle-making business.

On the retail side, one of my main initiatives for expansion is building more brand partnerships. I love working with larger companies and creating custom candles for larger brands — it’s a lot of fun. I’m also looking to expand Paige’s Candle Co. via online and in-store reseller partnerships.

EB: Where are you investing for growth?

PG: We've invested about 20% more in digital marketing this year, like Facebook ads, Google ads and Instagram promotions. It’s been very fruitful. We’re seeing a great return on investment, so we plan to continue.

EB: What kinds of tools have been valuable for the ecommerce side of your business? How do you determine which vendors you work with?

PG: On the accounting side, I hear QuickBooks can be a bit challenging for some, but it's been amazing for me. I also have payment processing through QuickBooks. Having everything in one place is key for me — and especially for my accountant when tax season approaches.

And then I absolutely love Shopify as my main ecommerce website. There are wonderful marketing tools that are integrated within the platform. For example, I use MailChimp for my newsletter and it’s integrated seamlessly.

As far as finding platforms and tools that will help my business, I do extensive research to see what is best for me. I appreciate free trials for specific programs so I can get a feel of whether or not a tool is a good fit for us.

EB: What’s something you think people get wrong about entrepreneurship?

PG: Hands down, I feel the vast majority of people overlook the patience required. We live in a culture where everyone is used to immediate gratification. When we look on our phones, we have one-minute videos. When we want to buy something, we can get it in a day. Building a business does not work that way. A lot of people expect to form a business and receive amazing results within three months. And you definitely can, but that's more of an outlier. The reality is, it takes a lot of patience. Building a successful business takes time.

Through our data, Enigma is proud to help financial institutions find successful small businesses, like Paige’s Candle Co., and connect them to the capital they need to grow and thrive.

Recession Watch? 9 Insights on the Small Business Economy

Enigma — Fri, 18 Nov 2022 00:00:00 GMT

Amid market volatility and inflation, many economists believe we’re poised for an economic downturn or a recession. For small- to medium-sized businesses (SMBs) and those who lend to them, priorities are shifting.

Enigma partnered with Fintech Nexus to host a virtual panel discussion about how to navigate SMB lending in a down market and why data will help us weather the impending storm. Here, we break down the top takeaways from the discussion and actionable insights for lenders.

Who are U.S. small businesses and what are their capital needs?

First, let’s clarify who we mean by small and medium businesses.

Many of today’s small businesses are very small: roughly 85–90% of small businesses in North America employ fewer than 20 people, says Chris Scislowicz, Managing Director and Head of Lending at Accenture, and of those, 50–55% are sole proprietors. The average revenue for a small business is $60,000.

Small business owners tend to need access to short-term capital like credit cards, lines of credit, and term loans. They’re spending their time focused on running their businesses, so it’s important that we meet them where they are and offer lending options that are fast and convenient, Chris says.

The pandemic accelerated the digitization of small and medium businesses, which could prove to be a silver lining if we’re heading into a downturn that requires more agility and flexibility.

Onto the question of a downturn.

Economic outlook: is a recession looming?

While none of the panelists profess to predict the future, they’re monitoring economic indicators closely and drawing upon past experience with the 2008 financial crisis.

Classic signs of a downturn include a drop in payment rates and uptick in delinquencies. But before these manifest, the first indicators are changes with spending patterns and levels – especially discretionary spending – among small business owners and consumers, according to Lakshmi Narain, Managing Vice President of Apollo, a subsidiary of Capital One.

A few of the panelists’ observations so far:

Consumers keep spending

Despite market turbulence and uncertainty ahead, Mastercard data shows consumers are continuing to spend—but the mix is changing, says Jane Prokop, Mastercard EVP of Small and Medium Enterprises.

Retail sales (excluding automotive) were up in August, about 12% year-on-year and up about 20% compared to August 2019
Online sales grew about 10% compared to 2021
The mix across sectors is dynamic. Sectors that focus on consumer experiences, like restaurants, airlines and lodgings, saw strong double-digit growth last year
Mastercard expects retail sales for the 2022 holiday season to increase about 7% from last year, with a small bias toward in-store sales (8%, versus a projected increase of 4% in ecommerce)

SMB owners continue to spend

Spending by SMB owners seems to be resilient globally, with double-digit year-on-year growth that has been accelerating since August 2022.

Certain categories, like food, warehousing, professional services, and travel and entertainment, have experienced the most significant growth, according to Mastercard data.

Credit risk is normalizing

Lakshmi says he’s seeing credit risk continuing to normalize, although it’s still below pre-pandemic levels. SMBs are still spending but they’re showing caution given inflation, increasingly price-sensitive customers, and a wariness to pass on increased cost of goods to customers.

Lakshmi sums up lending conditions with this driving analogy: “If you’re a typical lender, you still have one foot on the accelerator and the other foot on the brake, holding your steering wheel tight and watching who you’re giving a ride to. And if you’re a typical borrower, you’re asking yourself, ‘Can I afford that ride, given it’s starting to get expensive and starting to get a bit bumpy along the way?’ But overall, conditions still look good.”

Individual small business performance has declined in 2022

Enigma is also seeing spending increase at the macro level. But when it comes to individual SMB performance, Enigma Chief Product Officer Scott Steinberg says the percentage of small businesses with revenue growth dropped by ten percentage points from the beginning of 2022 to the end of the third quarter, according to a sample of Enigma data on small business monthly card revenue.

At the beginning of 2022, the percentage of small businesses that were growing (adjusted for inflation) was about 60%. By the end of Q3, that number had dipped to below 50%.

And yet: SMB owners are a resilient, opportunistic group

Small business owners continue to show themselves as a resilient and opportunistic group, Lakshmi says. New business launches are still on the rise since the pandemic catalyst, when we saw a 23% increase in startups year-on-year. New businesses emerging today are tapping into opportunities of the rising cost environment and disruption in the market.

At the beginning of 2022, about 60% of SMBs were growing (inflation adjusted). By the end of Q3, that number had dipped to below 50%, according to Enigma data.

What’s different about this downturn?

Today’s leaders draw upon lessons from not only the financial crisis of 2008 but a lingering global pandemic.

Fortunately for lenders and small businesses, if we’re staring down another economic slump, we’re bolstered by technology tools that can enable more informed, fair and thoughtful decisions than ever before.

And there’s still work to do. Panelists reflected on a few opportunities and trends in particular.

Micro-segmentation

Historical models for analyzing borrowers are out of date, says Chris. Micro-segmentation is a trend we've seen emerge during the pandemic and will be especially relevant as we head towards a potential recession.

Before the pandemic, if a bank was considering lending to a restaurant in a particular geography, it would look at the economics of that area. When COVID hit, lenders needed to know whether a restaurant offered takeout and delivery services. Lenders also needed to know whether businesses were considered an essential service.

The way businesses are evaluated will continue to evolve as data plays a bigger role in underwriting decisions, Chris says.

Greater focus on portfolio monitoring

In the last six months, Scott and the team at Enigma have observed a “sea change shift,” especially among fintech partners, in investments to “beef up” their portfolio monitoring capabilities. They want to ensure those capabilities are where they need to be “if the economy is heading the way we expect it to go,” he explains.

Use all the data

Jane says it’s important for lenders to use all the data they have at their disposal, and that it’s critical to consider things like the relative diversity of an SMB’s customer base.

“They could have terrific revenue from two customers and then that revenue can disappear practically overnight if they lose one of those customers," Jane says. "Understanding the details behind the cash flow is really important, particularly as you go into these kinds of periods when there could be severe disruptions in business.”

Data and decisions with intention

Chris cautions lenders to remember that the availability of data needs to go hand in hand with “procedural and process maturity” and that it’s critical to consider how to leverage that data appropriately. We should be asking, “What's important to make a decision? What are we allowed to use to make a decision? What's morally or ethically appropriate to make a decision based on?”

2023 and beyond: The need for speed (and agility)

Lakshmi notes that the SMB lending industry is better equipped than it was in the “Great Recession” of 2008-2009. He thinks thriving through a recession comes down to two things: how fast you can detect change or insight, and how fast you can act on what you detect.

“Technology has come a long way, both in providing the ability to access real-time data and in the applied science behind detecting patterns, trends and anomalies from using that data,” Lakshmi says.

Generalized credit scores are an incomplete picture of a business’ overall “wellness.” Platforms that incorporate alternative data sources, like Enigma, can increase approval rates dramatically.

As we look to the future, clear visibility into card revenues, revenue growth, transaction volumes and other metrics of financial performance can help lenders help more small businesses grow and thrive.

-------------------------------------------------------

This article is based on a panel discussion presented by Fintech Nexus and Enigma. Watch the replay.

Thriving through a recession comes down to two things: how fast you can detect change or insight, and how fast can you act on what you detect. – Lakshmi Narain, Managing Vice President of Apollo, a subsidiary of Capital One

Related downloads (PDF):

Reducing Bias in Our Dataset with New Card Types

Enigma — Tue, 15 Nov 2022 00:00:00 GMT

Enigma is on a mission to build a complete picture of U.S. businesses and provide the most accurate data about their revenues. We’re excited to share that we have now integrated a panel of alternative cards into our data that will increase the accuracy of our revenue data for all Enigma customers.

Enigma builds revenue estimations based on actual transactions from a panel of cards. A variety of factors impact the accuracy of our revenue estimates:

The size of the panel
Any bias or skews in the panel make-up
The precision of tagging transactions to a particular merchant

The new addition of alternative card types into our dataset will have a great impact on the first two factors: increasing our total panel size and reducing bias.

Increasing panel size

Prior to the incorporation of these new sources, Enigma’s card panel was already the largest in the U.S., covering over 70% of consumer general purpose credit cards and nearly 50% of debit cards. It’s now even larger.

We’ve grown the total number of active cards in the panel by 7%, from about 700 million to 750 million, with "active" meaning cards with at least one transaction in the past six months. The card spend in the panel has also grown by nearly 5%, from $11.7 trillion to $12.3 trillion.

As our economy continues to move away from cash — its usage dropped 15% in 2020 with no rebound in 2021, according to McKinsey — this latest expansion of our panel is another step on our continued journey to provide more visibility into business revenue.

Another priority is reducing the bias in our card panel.

Reducing bias in our card panel

Our original card panel comes from the largest credit and debit card issuers, laying a broad foundation for our dataset. But it also skews our data toward general purpose credit cards and debit cards, when payment methods like prepaid cards are becoming more common.

In a recent survey from Discover, 75% of respondents said they had used, bought or received at least one type of prepaid card over the past year.

Further, a panel tipped toward cards through traditional bank relationships brings a demographic skew.

About 19 percent of the U.S. population is unbanked or underbanked — they have no bank relationship or their needs require bank alternatives — and these rates were higher among adults with lower income or less education, and Black and Hispanic adults, according to the Federal Reserve’s Economic Well-Being of U.S. Households in 2021 report.

This latest data enhancement introduces net-new card types to our dataset, like payroll and government cards and gift cards, bringing fuller coverage of transaction activity from unbanked and underbanked populations. And the introduction of flexible spending and health savings account (FSA/HSA) cards will start providing more insight into revenue trends at doctor and dental offices.

What this means for you

Revenue accuracy will continue to improve over the next few months as we onboard this new data into our panel:

Total lift: All businesses will see more accurate revenue estimates and trends
Healthcare boost: Businesses in the healthcare industry (dentists, doctors, pharmacists, vision stores) will see a strong boost in accuracy, due to the addition of FSA/HSA cards
More inclusive coverage: positive impact in revenue coverage for merchants that serve lower-income communities

As we add additional data into our panel, we continue to monitor and adjust our projection factors such that our revenue estimates become more accurate and don’t balloon.

This enhancement aligns with our goal to deliver continuously improving data, seamlessly. For existing customers, there’s no additional cost or action you need to take to begin benefiting from this improvement. Unless you’re on an isolated version, these changes will start showing up in the Enigma data you receive.

Improving Our Research Velocity With lakeFS

Ryan Green — Tue, 01 Nov 2022 00:00:00 GMT

In every software engineering problem I’ve worked on, I’ve noticed a recurring tension between two highly desirable properties: flexibility and robustness. But in each situation, this tension manifests itself in different ways.

At Enigma, our goal is to build a complete set of authoritative profiles of U.S. small businesses. This requires us to integrate hundreds of different data sets and to continually create and refine machine learning–based algorithms in a highly research-driven development process.

For us, flexibility means we continually integrate new data sources and algorithms that we need to rapidly experiment with and validate. Robustness means running a complex data pipeline at scale while spending minimal time on maintenance.

The tension between flexibility and robustness arises when we’re excited about a potential research breakthrough and want to rapidly test out the effects it has on our data asset. We want to quickly deploy the change in an isolated data pipeline and measure the results against what’s currently running in our production environment.

In this post, I’ll discuss the various approaches we tried and why we’re now using data branching to address this tension.

Previous Approach

We initially tried to resolve this by having distinct production and dev pipelines. We deployed code to the dev pipeline from separate git branches and maintained copies of the data in distinct namespaces. This solution delivered a high degree of robustness, but at the cost of the flexibility we needed. The main problems we encountered were:

We needed to keep the data between the dev and prod pipelines in sync. At best, this required us to copy large data files from prod to dev. At worst, it required us to re-compute results on dev unrelated to the experiment we were running.
Data scientists conducting research needed to understand the state of our dev pipeline prior to running experiments and comparing them to prod. As a result, we frequently ran time-consuming experiments only to discover we couldn’t use the results. In practice, data scientists required the help of data engineers to run these experiments. This reduced data engineering teams’ velocity and limited data scientists’ autonomy.
As our team has grown, we experienced increased contention in our dev environment when we wanted to run multiple experiments at the same time. Different data scientists would need to wait for the dev environment to “free up” before they could test their changes.

New Approach: Data Branching

Earlier this year, we began to explore lakeFS for data branching as a way to resolve this tension. Data branching overlays a git-like abstraction on top of the physical data. A data branch is a logically isolated view of the data that can be modified and merged into other branches.

Data branching makes it trivial for researchers to create an environment based on the latest production data. With a simple command, a data scientist can create an isolated data branch for their experiment that’s guaranteed to be identical to production except for the specific changes they make. This empowers data scientists to work independently of data engineers.
Data branching resolves issues of environment contention by allowing for the creation of isolated experimental environments (each experiment runs on a different branch).

LakeFS branches solve the isolation challenge in a straightforward way. Today, every developer and researcher creates separate data branches, which includes a complete snapshot of the prod data (at no additional storage cost). You can make your change and review its impact on the final data set without fear of interfering with someone else's work or polluting the production data.

It’s also much easier for us to run parallel pipelines and maintain stable pipelines for customers who want to upgrade at a slower cadence.

Additional Benefits

In addition to making experiments easier, we saw other benefits for our production pipeline:

Branching for validation: Our data pipeline consists of a series of stages. Between each stage, we run validation logic before promoting the results to the next stage. We realized we could replace this hand-crafted promotion logic by running our candidate pipeline on a branch and merging this branch onto main if validation succeeded.
Data set tagging: Another challenge we had was determining which data set versions contributed to our final data asset. Tagging a branch provided clear semantics about the complete set of intermediate data sets that went into the pipeline. This is extremely helpful when diagnosing issues and anomalies on the final data asset.

Build vs. Buy

After briefly considering implementing this ourselves as a metadata layer using git branches, we decided to partner with lakeFS to provide our data branching solution. There were a few reasons for this:

LakeFS is 100% focused on providing a solution for data branching. We like their focus on solving this one problem extremely well.
We are impressed by the caliber of people at the company, including their leadership and technical talent. My experience is that smart people with high personal integrity and a clear focus are best positioned to solve difficult problems.
They launched an open-source solution two years ago, so the core capability had been battle tested. They were in the process of releasing a cloud hosted version of their product, which meant less maintenance and support for our internal team.

Achievements to Date

Over the past three months, we fully migrated our data pipelines to lakeFS. Overall, it’s been a successful partnership. For the most part, the product has met or exceeded our expectations. Where it hasn’t (mainly response time on certain endpoints) the lakeFS team has been fully engaged in turning around solutions — often in a matter of days. The attentiveness and sense of urgency is refreshing.

Moving to a data branching solution has paid off quickly for us. A few days after completing the migration, we’ve already reduced testing time by 80% on two different projects. And we’re excited to see how data branching increases our product velocity in the coming quarter.

Our Compensation Philosophy: Transparency, Uniformity, Fairness

Stephanie Spiegel — Fri, 28 Oct 2022 00:00:00 GMT

In this post, we’ll share an overview of our compensation philosophy and how we approach performance management by measuring impact.

Before jumping into Enigma’s philosophy, a quick 101 on “total rewards.” When we think about compensation, we generally only think about base salary, but total rewards is much more than that. Total rewards comprises a number of inputs including base salary, bonuses, equity, benefits, retirement savings, and many other non-cash compensation factors, like flexible PTO and hybrid or remote working. It’s all the forms of pay and benefits you receive from a company.

At Enigma, we’re really proud of our total rewards package. If you’re interested in a high-level overview of what we offer, have a look here.

Enigma’s Compensation Philosophy

Now that we understand total rewards, let’s dive into Enigma’s Compensation Philosophy. Here we’re guided by three principles:

Transparency: We share all relevant information on compensation openly with the company.
Uniformity: We use the same framework for compensation and it is applied throughout the company.
Fairness: Every effort is made to compensate people exclusively for the impact they deliver.

With these principles in mind, we’ve built a performance management system centered on individuals’ impact and competencies.

Five Levels of Impact

Our top priority is to develop a successful data product and deliver on our customer commitments. The primary factor in determining compensation is the amount of impact an individual has had on enabling the success of our business.

We have five Levels of Impact. Each level represents a scaling degree of impact an individual has on achieving our company goals, and compensation is directly tied to these levels.

Level 1. Exhibits ownership and executes independently.
Level 2. Exhibits ownership and executes independently within a functional area that drives the success of a top company goal.
Level 3. Uses their mastery in a functional area to drive the success of a top company goal and influence their peers to do the same.
Level 4: Drives an area critical to the success of a top company goal and shapes the culture in their functional area.
Level 5: An essential driving force behind the success of a top company goal and shaper of the company’s culture.

Compensation Ranges

The compensation ranges, including both salary and equity, for each Level of Impact are shared with the company.

We have two matrices of salary ranges: one for those in technical roles (software engineering and data science) and the other for all non-technical roles. Across the tech industry, those in technical roles have higher salary ranges than those in non-technical roles. But despite differences in salary, Enigma has parity when it comes to equity ranges across both technical and non-technical roles.

An individual’s assessed Level of Impact is only shared with the individual and their lead (aka manager). This allows for a balance of transparency while still maintaining confidentiality on the individual level.

Performance Reviews + Competency Frameworks = Impact

There are pros and cons to every performance management system. Some are too open-ended, lacking enough guidance; others, too burdensome to complete, or worst of all: no system at all. We’ve worked hard to try to hit the “goldilocks” target: right-sized and easy to complete.

Over the years, our performance management process has grown and evolved. We’ve tried a variety of approaches and have been working with our current “Competency Framework” approach for 3+ years. We consistently receive overwhelmingly positive survey feedback that people feel our process is transparent, uniform, and fair. We’re really proud of this, but it takes work.

Every role at Enigma has its own tailored framework that is based on the core competencies of that role. These frameworks are crafted in consultation with the individual and team and are constantly being improved for the next cycle. We want to ensure that the frameworks measure and take into account the work our people are actually doing.

How It All Comes Together

Enigma hosts semi-annual performance reviews, which comprise a self review, stakeholder review, and a manager review. We then host calibrations at the team and leadership level. Calibration is an important part of our process because it helps to ensure everyone has been measured against the same set of standards across the company.

Lastly, we adjust compensation based on the outcome of the calibrations. Compensation adjustments, both salary and equity, work in two ways: individuals who’ve demonstrated impact are eligible for compensation increases within their impact band range and could also be eligible to move to a higher Level of Impact.

Final Thoughts

We’re proud of the transparent, uniform, and fair system we’ve built over the years and continue to invite feedback on it.

Want to work for a company that values impact, transparency around compensation, and thoughtful performance management? Check out our open roles here.

How Restaurant Awards Affect Revenue

Enigma — Tue, 25 Oct 2022 00:00:00 GMT

For software and service providers who work with restaurants, external signals of growth and success, like awards and recognition, may be a route to finding new customers.

In the restaurant world, this might look like being named to a “best of” list — an honor that brings the benefits of exposure. We wanted to know: how does that exposure affect a restaurant’s performance? And is it a reliable signal of a restaurant’s success?

To find out, we explored card transaction data for restaurants that made The New York Times’ 2021 Restaurant List.

Analyzing 2021 Listmakers

We analyzed a sample of 20 of the restaurants from The New York Times’ 2021 Restaurants List (published October 21, 2021) to see what, if any, revenue impact came after appearing on the list.

We found that the range of revenue changes across the sample was broad, and roughly two-thirds of listmakers in our sample saw 2021 revenue growth compared to the same period in 2019.

When we compared the group of listmakers’ revenue changes from Q3 to Q4 2021 to the same period in 2019, we found that 65% of restaurants grew after making the list and 35% declined.

Across the sample, revenue changes ranged from 65.97 percentage points on the upside to -850.92 on the downside.

Next we looked closer at a few establishments that saw positive impacts.

This Charleston-based soul food restaurant saw a revenue spike the month the list was published. Monthly card revenues in the second half of 2021 increased nearly 60% over the same months in 2019.

For this northeastern Italian cuisine spot in Colorado, the change in their Q3/Q4 2021 revenue was over seven percentage points higher than the same period 2019.

The Takeaway

Our takeaway: a restaurant’s revenue trends after an award or recognition don’t follow a set pattern – especially in today’s post-pandemic environment.

So while “best of” lists are still a great way to find the best cuisine in a new city, they may not be your most valuable signal when it comes to finding and serving restaurants as small business customers.

A clearer signal? A restaurant's historic card revenue trends. As of publication date, Enigma’s card revenue data on restaurants includes more than 1.4 million U.S. locations that've shown card revenue activity in the past year. Our data is updated monthly, built from swipes of over 700 million anonymized cards, and goes back to 2017.

For more on “best of” lists, restaurants, and exploring revenue impact, watch the insights video.

Ready to explore the data yourself? Try a sample of our U.S. restaurants data on Snowflake.

Methodology

We analyzed a sample of 20 geographically diverse restaurants from The New York Times’ 2021 Restaurants List, which was published in October 2021.

We compared sample listmakers’ monthly card revenues from Q3 to Q4 2019 to the same period in 2021. We omitted 2020 data because of pandemic impacts.

Now: Search by URL to View Omnichannel Business Revenue

Andrew Campbell — Tue, 27 Sep 2022 00:00:00 GMT

Capping off a busy season for our product team, Enigma launches an easier way for you to access our data. This is especially relevant for B2B salespeople and underwriters of business loans.

We recently launched omnichannel revenues, a major product enhancement, with initial rollout via API and batch processing. This enhancement is now available in our web application, meaning you can access this data in a simple UI—no technical integration needed.

Two other important changes to note:

We’ve improved the search capabilities of our web application by enabling you to query based on URL only.
We’ve made it more seamless to navigate between business locations and business hierarchies.

Watch the demo video here.

Why should you care

Sales professionals: It’s now a lot easier to incorporate Enigma data into your prospecting research workflows. In just 30 seconds, you can find out a business’s revenue size and growth rate so you can go into a call armed with intel.

Underwriters of business loans: we heard your feedback. You want to know the revenues of an entire business without having to research each location of the business separately. With omnichannel revenue included in our UI, we believe you can save a ton of time and have greater confidence in the risk profile of the business you’re underwriting.

Ready to learn more? Contact us now.

Every Type of Swipe: Launching Omnichannel Card Revenue

Andrew Campbell — Tue, 06 Sep 2022 00:00:00 GMT

This month, we’re launching omnichannel card revenues. This data will help you find, acquire, and manage small business customers, knowing how they perform across channels.

Our coverage keeps getting better

Previously, Enigma reported firmographics and financial data for millions of individual brick-and-mortar business locations. You could confirm the registered name, industry, and even revenue for the establishment based on its address. Here’s an example of revenue at a single business location:

However, many of our customers wanted to understand the revenue of a business across all of its locations and online channels. Here’s what revenue for the business above might look like for all locations:

In May we launched business hierarchies to give customers a comprehensive view of a small business’s financial health across multiple locations. Now, we’ve enriched that view with omnichannel revenues. This additional data ensures that you see complete revenue from card transactions across all channels.

What this means

Access to omnichannel revenues provides a more holistic understanding of business performance. While some Merchant Cash Advance (MCA) providers may give location-specific loans, most lenders need to see the financial health of the entire business. For example, if you want to know the monthly growth rate for an omnichannel business, just looking at brick-and-mortar sales provides an incomplete picture.

Omnichannel data is also crucial for understanding trends across channels. If consumer behavior changes, in-store revenues by themselves may be misleading. Adding online card revenue gives a more complete view that reveals the true growth trend.

Check out the monthly revenue of a jewelry store chain (below), comparing the sum of card transactions at all locations and with omnichannel card revenue from 2017 through 2022.

Because many purchases moved online in 2020, revenue growth at brick-and-mortar locations was substantially lower than revenue growth across all channels. If someone was using our 3 month seasonally adjusted growth rate attribute to look at Q4 2020 before, it would have shown a decline of -16%. Now we can see that in Q4 2020, across all channels the business had a seasonally adjusted growth rate of 4%.

If you were prospecting or underwriting this business in 2020, and you only reviewed the in-store revenues, you might have overlooked this jewelry store as a viable customer. But, if you were looking at total omnichannel revenues, you'd have understood that they were performing as well as before.

Omnichannel revenues consistently provide a more complete picture of financial health, as seen with the pharmacy and general store shown below. If you only looked at sales in physical stores, you might believe these businesses are stagnant or declining. Omnichannel revenue shows you that these businesses are thriving.

Brick-and-mortar revenues for the pharmacy declined 3% year-over-year in 2020. However, omnichannel revenue grew 15%. Our 12 month revenue growth attribute now reflects the accurate omnichannel growth rate.

In 2021, the general store’s brick-and-mortar revenues grew around 8% year-over-year. However, omnichannel revenue growth for the same period was 35%, more than 4x greater.

What you can do now

Omnichannel revenues provide a more complete picture of card revenues at small businesses. Here are some of the things you can do with this new data:

For marketers:

Omnichannel revenues ensures higher accuracy when identifying businesses that qualify for your ICP.

Capture card revenue from both online and offline sources.
View overall revenue, even if the contribution from one channel decreases.
Avoid marketing to businesses that do not fit your target requirements.

For payment platforms and POS providers:

Omnichannel revenue gives visibility into current and historical card transactions.

View financial trends starting before your tools were introduced.
See growth from channels outside of your platform.
Offer MCA products or promotional rates to qualified businesses sooner.

For credit card and line-of-credit providers:

Omnichannel revenue provides better intelligence to make underwriting decisions.

View financial trends for a business without directly accessing their accounts.
Approve larger loans for qualified applicants.
Monitor existing customers for revenue changes that could impact risk.

How do I get access?

Get in touch with us to see how you can start using omnichannel revenues now.

A Guide to Optimizing Your KYB Process: In-House, Single Provider, or Waterfalled Data

Enigma — Thu, 01 Sep 2022 00:00:00 GMT

Know Your Business (KYB) requirements have only existed since 2016, but they've already proven to be one of the more operationally demanding compliance obligations that financial institutions face. The rules require verifying each business customer's identity, confirming who owns and controls that business, and monitoring risk over time — all while regulations continue to evolve.

The question isn't whether to do KYB. It's how to do it well without letting compliance overhead eat your growth margins.

Financial institutions generally have three options: handle the process fully in-house, work with a single outsourced service and data provider, or waterfall multiple data providers through an orchestration platform. Each has genuine trade-offs, and the right answer depends on your institution's risk profile and business objectives.

Option 1: Fully in-house KYB

Many smaller financial institutions build their KYB processes entirely in-house rather than relying on a specialized provider. Internal KYB teams build the technology infrastructure and hire compliance staff to approve or deny businesses according to company policy.

The process

Invest. An institution uses its engineering team to build a bespoke approval infrastructure, or hires an operations or compliance team to manually review applications. Most end up investing in both: an automated layer and a manual team to handle what the automation can't.

Collect identifying information. As part of the Customer Identification Program (CIP), institutions must collect basic information: business name, address, and Tax ID number (TIN). For legal entity customers, they must also collect information on beneficial owners and at least one control person. A beneficial owner for KYB purposes is anyone with at least a 25% ownership or voting stake — these individuals are sometimes called ultimate beneficial owners (UBOs).

Verify businesses. Internal teams or data infrastructure pull data needed to verify identity: names, addresses, and filing details from Secretary of State (SoS) filings for basic verification, plus risky activity or financial information for more sophisticated programs. The KYB process also requires screening against OFAC watchlists.

Verify beneficial owners. In addition to verifying the entity itself, institutions must verify and screen all beneficial owners and control persons against OFAC watchlists.

Monitor over time. The CDD Rule expects KYB to be a continuous process. In-house KYB requires bespoke methods to update customer information and re-verify or re-screen businesses over time.

Pros

Control. You know your business and risk profile better than any third-party provider. Building in-house lets you deeply customize the process for your needs.
Good for simple programs. If you have a limited risk profile and only need to verify a small number of customers, the cost of setting up an external partnership may not be justified.

Cons

Lack of expertise. Building KYB infrastructure often falls to engineering teams without compliance backgrounds — or compliance teams without data infrastructure experience.
Limited auto-approval coverage. Specialized data providers build their entire engineering capability around matching and approving businesses. An in-house system will typically be slower and less accurate, which can lead to customer drop-off when applicants choose faster alternatives.
Costly. Whether you build internal infrastructure, hire a large manual review staff, or both, the overhead for internal KYB is significant.
Time-consuming. The operational investment required to build and run an in-house program comes at the expense of building core financial products.

As Liam Chennells, CEO of end-to-end KYB platform Detected, noted in a recent Enigma webinar, while many institutions aim for 100% automation as a north star, small improvement attempts across a KYB process can also make an impact. Internal KYB makes sense for institutions that can develop a semi-automated process meeting their compliance goals — but broader automation and using KYB as a driver for client growth and retention may require a partner.

Option 2: Single outsourced service and data provider

Some institutions work with one outsourced service and data partner — either to supplement an existing in-house program or to build their onboarding process from scratch.

The process

Invest. The institution partners with a single provider, paying a setup fee and annual licensing fees to access the provider's data on an ongoing basis. The provider typically handles auto-approvals, removing the need to build that infrastructure in-house. The institution still needs to handle manual approvals for businesses the provider can't auto-verify — either in-house or through an additional manual review service the provider offers.

Verify businesses. The provider auto-approves businesses and flags others for manual review: businesses without an SoS filing, those with mismatched name or address, businesses in high-risk industries like cannabis or adult entertainment, and potential OFAC matches.

Verify UBOs. The provider pulls the data needed to verify UBOs and screens them against the OFAC list.

Monitor over time. The provider and institution work together to update customer information, periodically re-check SoS registration statuses, re-screen for risky activities, and re-screen against the OFAC list.

Establish trust. Institutions can validate their provider's accuracy by periodically sampling a set of auto-approved businesses to confirm the approvals are correct.

Pros

The middle path. A single provider partnership maintains some internal control while delivering a customized solution with one dedicated partner.
Focus on core competencies. Outsourcing compliance data work lets your team focus on your actual product and services.
A tight partnership. Working closely with one team builds communication and trust over time.
Reduce overhead and costs. With more auto-approvals, you onboard more clients with less effort. Companies using Enigma as their sole KYB provider are estimated to reduce KYB costs by 80%.

Cons

The middle path. You give up some of the control that comes with fully in-house KYB, and access to a narrower data set than you'd have with multiple waterfalled providers.
Onboarding time. Any new partnership requires time to integrate the platform and train your team on new tools and data.

Option 3: Waterfalled data providers via orchestration platforms

Many institutions work with multiple data providers through a third-party orchestration platform, sequentially passing business applications through multiple data sources until a match is found. Platforms like Alloy or Oscilar integrate multiple providers into one KYB decisioning endpoint. The waterfall sequence is typically designed around cost and approval time (latency), on the assumption that provider accuracy is roughly equivalent.

The process

Invest. The institution invests in the data aggregation platform, which uses multiple sources for auto-approvals. Manual approval still needs a separate solution.

Verify businesses. The platform attempts to verify a business using the first data provider in the sequence. If that provider can't match the business, it passes to the next, and so on. This approach typically produces higher match rates and more data on risky activities than any single provider alone. Businesses that can't be auto-approved go to manual review.

Verify UBOs. The platform auto-approves UBOs using data from multiple providers. KYB rules allow institutions to trust self-reported UBO information unless they have specific reason to doubt it — for example, when an owner name on the application doesn't match the owner name in SoS filings. The platform also screens UBOs against the OFAC list.

Monitor over time. The platform checks SoS registration statuses, re-screens for risky activities, and re-screens against the OFAC list on a periodic basis.

Establish trust. Similar to single-provider monitoring, institutions can conduct monthly checks on individual sources to confirm auto-approval accuracy.

Pros

Maximum coverage. More data sources means more coverage for auto-approvals.
Further reduced overhead and costs. Like a single provider, waterfalling lets you auto-approve more businesses and onboard faster. The effect multiplies across providers. Enigma, for example, cuts costs an extra 50% for institutions already using another provider.
Adaptable. A platform built for multiple providers can accommodate new data sources as legislation or needs change.
Goes beyond KYB. A data waterfall infrastructure that's already set up for KYB can also support fraud checks, risk checks, and underwriting.

Cons

Multiple parties to manage. Waterfalling multiple sources means establishing trust and communication with each provider. You may lose some customization as you incorporate more vendors.
Overkill for simple programs. If you only need one or two data sources to meet your requirements, a full waterfall architecture may be more complexity than you need.

Heidi Hunter, CPO of identity verification data provider IDology, praised the data waterfall approach for "smaller organizations looking to move quickly into the [identity and KYB] space," adding: "layering those capabilities, you are able to get information from many different sources to give you a clearer view [and] drive ROI."

How to choose the right approach

The decision comes down to your institution's risk profile, growth objectives, and operational capacity.

Want absolute control? Keep the process fully internal.
Want one close partner who handles the complexity for you? A single service and data provider is the middle path.
Want maximum auto-approval coverage and flexibility for future growth? Waterfall multiple providers.

Alloy CPO Parilee Wang put it well: "We're getting access to new types of data that can change how you make decisions. There are new vectors of fraud coming daily at this point. There are new technologies that folks can take advantage of. So the one thing I advocate for very strongly is the value of flexibility, and cheap flexibility."

Choosing the right solution now may not be the right solution in six months — especially given the pace of change in KYB regulation. In 2022, FinCEN established a beneficial ownership registry requiring certain corporations, LLCs, and similar entities to report their ultimate beneficial owners to the federal government. The INFORM Consumers Act extended identity verification requirements to online e-commerce platforms as well. Legislation in this space continues to evolve.

A quick comparison

	In-House	Single Provider	Waterfalled
Control	High	Medium	Lower
Auto-approval coverage	Limited	High	Highest
Cost	High overhead	Up to 80% savings	Additional 50% savings
Time to implement	Long	Medium	Medium
Flexibility	Limited	Medium	High
Best for	Simple, limited-volume programs	Most financial institutions	High-volume, complex programs

If you're working toward a single provider or waterfall approach, see our KYB Requirements Checklist for the full list of data points you need to collect. And for a closer look at what instant approval rates actually look like in practice, read How Enigma KYB Cuts Compliance Costs by Up to 80%.

Thinking about where Enigma fits in your KYB process? Learn more about Enigma KYB or get in touch with the team.

Enigma Revenue Data vs. Reported Revenue: Comparing 4 Brands

Enigma — Thu, 11 Aug 2022 00:00:00 GMT

To find and prioritize your ideal small and medium business (SMB) customers, getting visibility into their financial health is key.

An accurate view of an SMB’s financial health helps you tailor your products and services offering for best fit – whether that’s approving or extending a line of credit, customizing your marketing efforts, or focusing Sales’ attention on your fastest-growing customers.

For certain types of businesses, card revenues – income from card-based transactions – can give you an incredibly accurate portrait of a business’s overall financial health.

The Accuracy Test: Reported vs. Enigma Revenue Data

As we continue to improve the accuracy of our card revenue data, we look for publicly available, reported company revenue data to validate our monthly card revenue data. Reported revenue data tends to be about large, public companies, and our specialty is small businesses. But we apply the same data science techniques to the bigger brands in our dataset, so the accuracy will be on par with the millions of small companies we cover.

We took a look at reported revenues for a handful of different companies you’d recognize and found Enigma’s card revenue data was strongly correlated with reported revenues. Here’s how our data compared.

Online Furniture & Home Decor Retailer

Reported revenue for an online furniture and home decor retailer showed a gradual increase from the second quarter of 2018 to Q1 2020, a dramatic, near-doubling spike in Q2 2020, followed by a general tapering off, with a minor spike in Q2 2021. Enigma’s card monthly revenue data closely mirrored this pattern, with a strong correlation of .9940.

High-End Sportswear Retailer

A high-end sportswear retailer sells clothing and gear online and in stores. The company saw recurring revenue spikes in the fourth quarters of 2018, 2019, 2020, and 2021—a pattern reflected in a strong correlation with Enigma’s card revenue data (.9533).

Sporting Goods Retailer

Looking back to the start of 2018, a certain sporting goods retailer saw its biggest revenue dip in Q1 2020, with a quick rebound in Q2 2020 back to Q4 2019 levels. Like the online furniture and home decor retailer, the sporting goods retailer also saw fourth-quarter spikes each year. Our card revenues were strongly correlated with reported revenues for this business (.9612).

Fast Casual Mexican Restaurant Chain

Revenues for a fast casual Mexican restaurant chain followed the smoothest upward trend line of the sample bunch, starting at nearly $1.3 billion in Q2 2018 and gradually climbing to $2 billion by the first quarter of 2022. There was a strong .9962 correlation between the restaurant chain's reported revenues and Enigma’s card revenues.

What These Accuracy Results Mean for You

The revenues for these sample companies may be far larger than those of your small business customers. But the data science process we use to aggregate their revenues is the same, so you can expect high accuracy in our small business revenue data.

The results mean that, like the risk and underwriting teams who rely on our data for high-stakes credit decisions, you’ll have the timely signals you need to approach decisions about your small business prospects and customers with confidence, like which of my customers are growing fastest? Which are in distress? On which accounts should our Sales reps invest their time?

The Value of Actual SMB Revenue Data

Revenue data in the market tends to fall into two buckets:

Modeled revenue is an estimated figure or range based on a set of assumptions, oftentimes taking into account business revenues, headcount, and web traffic. At many companies, these models are not updated very frequently.
Actual revenue is derived from real activity — like the transactions captured in card revenue data.

Enigma’s card revenue data comes from actual debit and credit card transactions at a business. To build a full picture of a business’s card revenues, we aggregate the revenues from all of that business’s physical locations, plus any online transactions. Revenue from third-party resellers may not always be captured.

Card revenue data is especially accurate for industries where card payments are common, like restaurants and retail. For industries that rely heavily on other payment methods, like ACH transactions, card revenues will be a lower percentage of total revenue, but can still be a helpful indicator of growth trends.

What could you do with more accurate financial data about your small business customers and prospects?

Learn more about how our customers are using Enigma data for customer management, improving underwriting models, sales prioritization, and building better prospect lists for go-to-market teams.

Product Vision: Answering Any Question About a Small Business

Scott Steinberg — Mon, 01 Aug 2022 00:00:00 GMT

Many of Enigma's customers are concerned that small business delinquencies will rise in the second half of the year, leading to a renewed focus on risk assessments. Last year many lenders were laser-focused on growth. Now, more customers want to discuss how they can step up their portfolio monitoring capabilities.

More customers want to discuss how they can step up their portfolio monitoring capabilities.

Whether the focus is on growing revenue or reducing risk, Enigma can help. We continue to develop the data platform that will deliver the most accurate and actionable intelligence about the health and identity of every small and medium business (SMB) in the US; a one-stop platform to answer virtually any question you want to know about an SMB:

What are this business's revenues?
What are this business’s cash flows?
What products or services does this business provide?
What is the momentum of this business?
Does this business have any signs of distress?

We are several years into this journey and there's still an incredible amount of work ahead of us. Our vision continues to expand as we hear about more and more pain points from our customers that we think we can solve. As I often tell the product team, we’re running a marathon, not a sprint (though hopefully at a six-minute pace).

In the first half of the year, we tackled a lot of structural work, enabling us to frame SMB data at both the individual location and brand level. Our key goals were to increase business coverage and attribute granularity, providing a better indication of business health.

Imagine an emerging retail brand with three physical stores and an ecommerce shop. Our customers can now view the health of the total business (three stores + online) as well as each individual store. This is a key improvement to enable Enigma’s data to speak the same language as our customers.

Beyond that, we made several other product improvements worth celebrating:

Giving customers access to up to 5 years of historical revenue information on businesses
Increasing both granularity and accuracy of our industry classification attribute
Giving customers the ability to search by business website to find matching Enigma profiles

Looking forward to the second half of the year, we will continue expanding coverage while making the data more accessible. You can expect a few themes of product improvements, including:

Several attributes that answer new questions about SMB revenues
Millions of new business profiles with revenue information
Increased accuracy of our Merchant Transaction Signals, made possible by the incorporation of additional payment data

We can’t wait to tell you more about Enigma’s evolution. As always, thanks for partnering with us on this journey.

–Scott

Chief Product Officer

Enigma Technologies Inc.

AUC is Worthless: Data Science in the Business World

Enigma — Tue, 26 Jul 2022 00:00:00 GMT

Demand for data scientists and machine learning engineers has exploded. There are plenty of programs that teach emerging technical skills, but Enigma senior data scientist Dillon Gardner points out that there’s a gap in learning how to apply those skills in real-world business settings.

In his recent conference session at PyData Global London 2022, “AUC is Worthless: Lessons in Transitioning from Academic to Business Data Science,” Dillon walks through an example of how a better understanding of business can lead to better data science outcomes.

Looking to learn from data science leaders like Dillon? Browse our open data science roles.

Industry + Revenue: Find Your Ideal Customers

Andrew Campbell — Thu, 21 Jul 2022 00:00:00 GMT

Industry and revenue are both key inputs for segmenting your small business prospects and customers. However, it’s hard for lenders to find a single data source that can provide both. Enigma has increased the accuracy and granularity of our Industries attribute, so that you can use revenue, growth, and detailed industry data to build prospect lists that exactly match your ideal customer profile (ICP).

To understand why it’s critical to have both industry and revenue data, let’s consider an example. A merchant financing company, Kappa Financing, knows their ICP is merchants who match the following criteria:

$1 million - $10 million in revenue
Average transaction size of $100 - $300
Sell home furniture or jewelry

Kappa currently uses a range of sales tools and data providers to get leads, but half of the leads are either in the wrong industry or don’t have the right transaction sizes.

Previously, Enigma could have helped Kappa precisely target based on revenue or transaction size, but not industry. Now we can.

With our latest improvements to industry accuracy and granularity, Enigma can help you build prospect lists or prioritize existing customers based on both detailed financial information and industry. Enigma has invested in building data models to accurately categorize industry for years, and we’re excited to share this latest improvement with our customers.

For a deeper understanding of the update, we spoke with Jinghong Cui, Product Manager for Industries.

So what is changing with industries?

It comes down to two things. First, we’ve simplified our industry classification to be based predominantly on the North American Industry Classification System (NAICS). Previously, we were joining multiple different classification systems.

Second, we can now tag more granular industries with better accuracy. We’ve added more than 260 new NAICS codes—triple the number we had before—many of which are at a granular four- or six-digit level.

Enigma has industry data for over 30 million business locations. And for the millions of businesses where we provide card transaction data, our research indicates over 95% industry accuracy.

"For the millions of businesses where we provide revenues and growth data, our research indicates over 95% industry accuracy."

Why should lenders care about this release?

Accurate and granular industry data, along with revenues and growth, are crucial for segmenting and prioritizing prospects. You want to find customers that are a good fit for your product. Most marketing and sales teams have a clear idea of what kinds of businesses are a good fit — their challenge is identifying and reaching those businesses. That’s where better data can help.

In some cases, our latest update doubles the detailed industry data available to customers. At the same time, we’ve simplified the attribute structure so it’s easier to understand and use.

What’s the impact of this improvement?

Since accuracy and granularity are both so important in how we label businesses by industry, a major goal of this update is to increase the number of businesses we can categorize into industry codes with four- or six-digit granularity.

The graph below shows the impact of this update on Enigma's businesses where we provide card transaction data. The increase in four-digit or more granular NAICS codes means that businesses matched by Enigma now feature much more specific detail.

What is something that surprised you when working on this feature?

There are two learnings that jumped out.

One is that sometimes the way the NAICS Association classifies a business is not the same as what lenders might expect. For example, some of our customers consider a retail bakery to be a retail store, which would fall under NAICS sector 44-45. However, “retail bakery” is actually classified under NAICS sector 31-33 (Manufacturing).

The other is that some customers want even more granular industry classification than the six-digit NAICS code. For example, “wedding planning services” is currently grouped into NAICS code 812990 - All Other Personal Services. A closely related industry, “bridal gown shops,” is grouped into a completely different category, NAICS code 448190 - Other Clothing Stores. We’re hoping to address this challenge in a future release.

In addition to industry, what other filters can customers use to generate prospect lists?

Customers can build custom prospect lists with filters based on the following Enigma data:

Card revenues
Revenue growth rate
Average transaction size
Businesses that sell products online
Geographic location
Industry

You can also use these attributes to enrich existing customer or prospect databases and prioritize businesses for marketing or sales campaigns.

Want to learn more about how Enigma's data can help you find your ideal small business customers? Get in touch for a customized demo.

Finding the Gems in Your Customer Base

Enigma — Fri, 15 Jul 2022 00:00:00 GMT

The small and medium business (SMB) lending landscape is evolving rapidly. Existing institutional players like traditional banks and financial institutions (as well as the Small Business Administration) are increasing lending alongside new players entering the market. Amazon, which has been partnering with banks and fintechs since 2011, has lent close to a billion dollars to SMBs in the U.S.

This rapid acceleration and innovation mean SMB lending is growing more competitive for lenders, and those firms with established SMB portfolios have an advantage. Customers in the risk and underwriting spaces tell us it's critical to find growing companies early and build relationships for long-term account value.

Risk and underwriting leaders: you’re uniquely positioned here. While you’re monitoring customer portfolios to mitigate risk, you’re also primed to spot opportunities: the growing “gems” in your customer base.

And to spot opportunity targets early, you need timely signals of growth.

We explored this challenge on a recent webinar. Charles Zhu, Vice President of Product, and Alexander Lee, Product Manager, unpacked:

What revenue growth data is and how it can signal business growth
Examples of promising businesses and their early growth signals, and
How leaders can use this data to find customer gems.

Here’s an overview of their discussion.

Small business revenue data: See the trends, spot the needs

Chances are, you already have small business gems in your customer base with growth patterns that suggest they're ready for new loans, credit line increases, or new products.

How can you offer your growing SMB customers additional access to the right kinds of capital at the right time, before your competitors?

Identifying growth trends early is key. Annual revenue numbers aren’t timely enough to be helpful here. But if you can get a look at a company’s monthly revenue, you can get a better sense of its health. Better small business intelligence means you could see:

Seasonal peaks: When revenues repeatedly trend higher at designated points throughout the year, a working capital loan may help a business get through a busy season.
Seasonal dips: When revenues predictably drop and bank statements show little cash on hand, a bridge loan might propel a business through an expected lull.
Long-term growth: Businesses with revenues showing sustained growth promise to have varying capital needs throughout their customer lifecycle—gems to be prioritized.

Risk and underwriting leaders have a unique vantage point. While you're monitoring portfolios for risk, with timely signals you can also watch for these early growth patterns within your customer base. But how do you introduce these signals of revenue growth?

Card transaction data is the key

One of the best ways to find revenue growth trends is by looking at card transaction data, generated whenever a credit card—whether debit, small business, corporate, or charge card, or even a “card not present” transaction—is used to purchase goods or services from a business.

The more society moves away from cash, the more powerful card transaction data becomes as a signal. A McKinsey study found that by the end of 2020, U.S. consumers used cash for just 28% of transactions, compared to 51% a decade prior.

Card transaction data is notoriously messy. But when it’s collected, cleaned up, and matched to businesses at scale, it builds a remarkably accurate profile of a business. More on this small business revenue data and how we aggregate it in our earlier post, “A Guide to Card Transaction Data.”

Card revenue data can offer other financial health insights, like transaction volumes, average transaction size, and ticket size. You can also get a sense of a business’s customer base, through customer transactions: is one whale spending $10,000 a month, or are 1,000 customers each spending $10 a month?

In combination, these attributes can paint a rich, meaningful portrait of a business. They can be used for signals of growth or signals of decline and risk. When we aggregate these different card transactions by business, we call it “merchant transaction data.”

In certain industries, merchant transaction data is especially strong: for example, restaurants and retail are almost entirely card-based, so this data becomes an accurate indicator of total revenue as well as revenue trend lines.

Real-life gems and hindsight

To see how this might play out with real businesses, we analyzed our data on a few companies from the headlines that have been on a growth path. We’ve overlaid key announcements onto the charts, like fundraising rounds.

With hindsight, we wanted to explore: what growth signals were present before the company’s success was broadly known?

Company A is a fast casual salad chain established in 2007.

Long before its dramatic card revenue increases in 2019, we can see the company had about 50% year-over-year growth back in 2017 and 2018. Those growth signals would’ve shown up in the data before it was public knowledge that, say, the company was looking to open new store locations or secure the fundraising round it ultimately announced in November 2018.

We can also get a sense of performance through the pandemic. Revenues fell dramatically in early 2020 – then bounced back post-pandemic, leading up to the company’s IPO in November 2021.

We can see that card revenues for this salad chain tend to spike in the summer months compared to the winter months (prep for beach season?). Looking at card revenues at a granular, monthly level can help us understand seasonality trends for certain businesses and consider their growth needs through that lens.

Company B started out as an affordable online shop for glasses and now offers a full range of vision care products and services, in-store and online.

Aside from a pandemic dip from April to August 2020, we can see the company’s card revenues have built steadily through its IPO in September 2021. We see modest revenue spikes each December, perhaps reflecting customers spending their use-it-or-lose-it vision benefits. During this timeframe, media coverage tells us that the company was steadily growing its brick-and-mortar footprint as it expanded products and services.

These examples may be larger companies than the small businesses in your portfolio. But they’re meant to illustrate how granular monthly revenue data can help you catch your customers’ signals of growth earlier — and respond proactively to better serve their evolving needs.

Small business intelligence for a changing landscape

As the small business lending landscape continues to evolve, organizations with existing SMB customer portfolios will have an advantage. The key is tapping into businesses’ growth signals so you can determine who’s ready for additional products and services — or will be soon. That’s a tough read from an annual revenue figure alone.

Card revenue data is a powerful tool for helping you to better understand SMB financial health and uncover the most promising businesses already on your books.

This article is based on an Enigma webinar. Watch the replay now.

Learn more: how a top-10 SMB lender found 70,000 gems in its customer base with Enigma data.

Package Management: Exploring New Map Layers

Robert Grimm — Thu, 23 Jun 2022 00:00:00 GMT

This is the final post in a three-part series on the wondrous world of package management. Catch up on post #2 here if you missed it.

So far along our journey through the world of package management, we’ve stayed focused mostly on the major technical challenges. For most of us software engineers, that falls squarely within our comfort zone and we might be tempted to stay put. That would be a major mistake.

If you tried to identify what makes Go’s package ecosystem simpler and more usable, you’d probably start by listing some technical achievements: first, Go’s package manager has the benefit of hindsight after observing the consequences of building on an NP-complete package resolution algorithm. Second, Go’s alternative algorithm is a genuinely clever one that maximizes what’s possible with a linear time algorithm.

But neither point comes close to explaining why Go’s approach works in practice and has been accepted by its community. To do that, you’d have to look at the social practices around maintaining and consuming software packages — which inform the design of Go’s package ecosystem throughout.

In this post, we’ll shift away from the mostly technical discussion and instead talk about two additional map layers in the world of package management: the social and economic considerations. Both layers cover the entire world of package management and hence are critical for understanding it. They differ in that social aspects serve as dragonbane, whereas economic aspects serve as dragonwort.

Dragonbane: Social Aspects of Package Management

Contemporary reconstruction of the world as mapped by Anaximander or Ἀναξίμανδρος (610–546 BCE). Source: Wikimedia, public domain.

In part two, you’ll recall that we built a dependency checker to identify version constraints in our repositories, and that we got pushback within the team against fixing every single affected repository. That contention deserves further exploration because, by focusing on what matters, we hopefully can draw a better map of package management. That way, we might just be able to follow the footsteps of Anaximander, who over 1,110 years before Isidore of Seville drew a map of the world that is significantly more accurate and detailed — as you can see above. The center of that map happens to be Miletus, Anaximander’s home.

Package management became a thing during the early days of open source, between 1993 and 1994 CE, largely as a more humane alternative to building software yourself and for the FreeBSD, Linux, and OpenBSD operating systems as well as the Perl programming language — all of which are open source. Open source software eliminates some aspects of commercial software distribution, such as payment processing or copy protection, that significantly complicate the latter. But it also requires significant user expertise, involves many more stakeholders across many more organizations, and has attendant technical, operational, legal, and economic risks. With this understanding, package managers primarily help with coordination and risk management. It’s only by doing so that they enable the fine-grained composition of software components (”packages”).

In other words, package managers are tools for addressing social problems.

In his blog post on package management, Sam Boyer presents package managers as tools that seek to minimize harm in the presence of significant risks and uncertainties. He builds on a blog post by Julia Evans, which probably first applied the concept of harm reduction to software development practices. Harm reduction is the proven public health practice that seeks to make undesired outcomes from, say, recreational drug use, more unlikely instead of rigidly rejecting anything but abstinence as acceptable behaviors. Boyer argues that’s also the primary function of a package manager.

While I believe that Evans and Boyer make a convincing case, I would like to offer a second, complementary perspective: package management is an exercise in communicating and managing expectations. When compared to other engineering disciplines, software development stands out for its constant and far more rapid change. So a primary criterion for development tools is whether they help manage that change and isolate humans from negative consequences as much as possible.

Semantic versioning is a great example of a convention that does just that: a patch release implies a bug fix, a minor version release implies new features, and a major version release implies backwards-incompatible changes. But communicating the latter is not enough: backwards-incompatible changes are almost guaranteed to cause significant work and disruption. Go’s package manager builds on that insight and shifts some of the pain of major version releases back to where it belongs: to the package maintainers instead of the package consumers.

The contrast to Python’s package ecosystem is stark. Many of the standards for Python’s package ecosystem, including those for version numbers and constraints, really are piecemeal green-field exercises in specification writing that aren’t anchored in practical requirements but optimize for abstract notions such as extensibility and flexibility. That’s bad enough but, worse, the maintainers of the primary package manager, pip, do not even use their own tool.

The practice of using one’s own tools — so-called dog-fooding — is an important systems-building practice because it provides the developers with feedback early and often. I have certainly used it while building my parser generator featuring modular syntax. It is also used by cargo, the package manager for Rust; maven, the package manager for Java; as well as npm and yarn, the package managers for JavaScript. It is a tremendous missed opportunity for pip's developers and it probably explains many of the peculiarities and pathologies of the Python package ecosystem.

Dragonwort: Economic Considerations

Contemporary reconstruction of the world as mapped by Hecataeus of Miletus or Ἑκαταῖος ὁ Μιλήσιος (ca. 550—476 BCE). Source: Wikimedia, public domain.

We are on a roll mapping out the non-technical challenges of package management. So in the spirit of Hecataeus, who significantly improved on Anaximander’s work in mapping the world around Miletus, it’s time to explore economic aspects of package management.

Most package ecosystems don’t just trade in open source projects but also do so under the permissive MIT and Apache 2.0 licenses — reaching almost 63% of packages, according to one recent survey. That makes using these same open source packages an attractive proposition for corporations as well. As a direct result and not surprisingly, package managers have become a standard tool for software development in general. To leverage this critical infrastructure for commercial software development, enterprises such as Enigma interface with the commons through a private proxy that is backed by the public registry. Internal packages are published to the proxy only, which ensures that they remain invisible to the outside world.

For instance, our small business data processing pipeline integrates 220 external open source Python packages that way — 63 of which are direct dependencies of our internal tasks and libraries. The relatively small number of external dependencies probably is a result of our pipeline using a few huge packages, including NumPy, Pandas, and PySpark that cover most needs. Furthermore, Python developers thankfully don’t tend to follow the micro-package dogma of, say, Node.js’ package ecosystem. For comparison, I wrote the custom static website generator for my personal website — a much simpler piece of software — in Node.js and very much sought to minimize external dependencies, preferring to code as much as I could myself. Yet my static website generator still requires 375 external packages.

That a gift economy would become foundational to the hyper-capitalist technology industry is nothing short of astonishing. It certainly helps that, thanks to the unprecedented rate of change in computing technology, source code by itself, when not maintained by engineers, rapidly diminishes in economic value. At the same time, the contrast between the collaborative open source commons and the highly competitive industry it enables has also resulted in contentious and even exploitative practices. One outcome is that relatively simple or low-level libraries serving as convenient but nonessential building blocks are shared liberally — hence the increasing fraction of packages using permissive open source licenses. Yet complete and sophisticated services such as databases or search engines are shared only under newly restrictive licenses.

Some developers of open source software are increasingly chafing at these economic disparities and have taken to sabotaging their own packages, for example, by removing all source code from the repository or by replacing a package’s functionality by something less useful if not outright dangerous. That has resulted in considerable disruption especially within the Node.js ecosystem — with much anger directed at responsible developers. Personally, I can empathize with the frustration felt by developers protesting against such exploitation and also by developers cleaning up the resulting messes. But therein lies the rub: the pain of disruption caused by developers who protest is largely felt by other package developers—not by the company executives and venture capitalists who are empowered to correct the disparities.

Thankfully, a well-designed package manager can help you mostly avoid confrontation with these particular dragons. However, npm did not. Its centralized registry saw a tenfold increase in load from November 2012 to October 2013. It almost broke the ecosystem at that time. The unique solution taken by npm's primary developer and copyright holder of the source code was to create a company backed by venture capital. That worked for a while but caused significant strife several years later. npm Inc’s former CEO tells one story and npm Inc’s employee #2 tells quite another. Both are very interesting. In contrast, the design of Go’s package ecosystem dispensed with package distribution for its registry. That significantly reduces operating costs and eliminates one source of contention.

Alas, by now, 2022 CE, the npm registry and much of the code for the Go programming language are at least mirrored by Microsoft’s GitHub, the most popular open source commons. Microsoft is also responsible for the development of the most popular open source IDE, Visual Studio Code, as well as the most popular cross-platform application runtime, Electron. For good measure, it optionally ships the Linux open source operating system as a subsystem of Windows. If you knew the Microsoft of the late 1990s and early 2000s, that’s quite a turnaround. In any case, it means that all software development now critically depends on Microsoft.

Then again, when you consider that the Olympic mountains to the west of Microsoft’s headquarters in Redmond, WA and the Cascade mountains to the east make for ideal dragon breeding grounds, that development isn’t too surprising.

Conclusion

Across a series of three blog posts, I made four major points: first, it is possible to rein in dependency hell, even in Python, and without a monorepo. It took us only a little more than 2,600 lines of code (not counting tests).

Second, building a package ecosystem on an NP-complete version satisfiability algorithm is madness, especially now that we know a much better alternative. If you are involved in maintaining a package manager, it’s time to switch to Go’s much saner design.

Third, while we software engineers love to engineer ourselves out of every challenge, we’d be well-advised to be more cognizant of social and economic factors first. They make a huge difference.

Finally, we address interesting engineering challenges here at Enigma. You might want to consider joining us.

🐲 Disclaimer: No dragons were harmed during the development of our dependency checker tool or the writing of this series. Enigma practices a strict catch-and-release program under supervision of Rubeus Hagrid and the World Wildlife Fund.

Package Management: Make Your Own Kind of Map

Robert Grimm — Wed, 22 Jun 2022 00:00:00 GMT

This is post #2 in a series. If you missed the first, read it here.

Welcome back to the wondrous world of package management.

You’ll recall we started exploring this world after pip, Python’s package manager, switched to a new package resolver that strictly adheres to version constraints and package installation suddenly might take hours when before it took seconds.

Our exploration revealed a big world of harsh terrain that’s infested with dragons and all kinds of other ferociously dangerous critters. Yet there are few maps of this world and they all suffer from significant inaccuracies and omissions. So we channeled our inner Isidore of Seville and set out to more completely and accurately map the world of package management.

In this second installment, we’ll talk about why you should maintain your own custom maps and leverage them to prevent dragon infestations. And I’ll share how, to solve the dilemma that started this quest, we transformed a simple script for extracting package requirements from 34 version control repositories into a checking tool that is based on the dragon-repellent design of Go’s package manager and validates every merge request for those 34 repositories.

The Evolution of Enigma’s Package Dependency Checker Tool

Part of our pipeline’s dependency graph as visualized by Graphviz 2022 CE. Laying out the graph left-right instead of the default top-down and in topologically sorted order would improve the presentation somewhat. Or maybe a visual map of a package dependency graph just isn’t that helpful.

I hope that you agree by now that Python package management is like catnip for dragons or dragonwort and hence irresistible to those ferocious buggers. But that knowledge — by itself — doesn’t really help us.

As we covered in the first post, I set out to locate the specific dragons hiding amongst the Python package requirements of the 15 tasks and 19 internal libraries in our small business data processing pipeline and wrote a script to avoid the tedium of manually inspecting 34 repositories. The first version of the script already used the GitLab API to extract each project’s requirements files as well as Artifactory’s API to extract the list of internal distributions. It then parsed the requirements, combined them into a global dependency graph, and tried to generate a visual representation with the Graphviz tool.

Alas, that first version also had three minor shortcomings: the code was rough. There were no tests. It didn’t work.

Ahem—let’s be more precise: after some initial effort, Graphviz did produce a faithful visual representation of the dependency graph. But as you can see above, that representation had an uncanny resemblance to an angry toddler’s doodle. Reorienting the drawing order from Graphviz’s default top-down to left-right, sorting nodes in topological order (graphlib in Python’s standard library is truly awesome), and then declaring nodes before edges all improved visual clarity. But the resulting graph still wasn’t actually useful: it now resembled a high school student’s first technical drawing assignment — after that student got so frustrated two-thirds through, they went into angry-toddler-doodling mode. It appears that our package dependency graph simply is too complicated for visual representation.

Luckily, my tech lead had run the script himself and noticed that it was also generating two text files. I had added generation logic for debugging purposes, with both files formatted in glorious Markdown. One file listed the dependency graph just as in requirements.txt files, that is, all dependents organized by dependee or required organized by requirer, and the other file listed the graph organized by dependents, listing the dependees or requirers. That second file turned out to be the perfect dragonbane by making the identification of inconsistent requirements trivial: it listed all version constraints for the same package, one under the other.

I created the necessary issues for manually cleaning up the requirements.txt files and, over the next few sprints, the entire team chipped in to get them done. We even beat our internal deadline, thanks to my script and because our dragon infestation was smaller than we feared — as it turns out, dragons produce a lot of smoke but not that much fire.

With that infestation handled, Enigma's CTO immediately asked how we planned to prevent recurrences. Thus the project scope grew to include a checking tool that would run as part of every merge request and keep requirements consistent. That implied three areas for improvement:

First, we needed internal consistency checks, since the dependency checker relied on a manually configured list of pipeline repositories and such secondary sources of truth have the annoying habit of becoming inconsistent with reality. That may still happen with the consistency checks in place, but the tool fails with an error message identifying the package missing from the internal package list.
Second, we needed a better graph representation with a uniform representation for edges to simplify traversal of the dependency graph in both directions, i.e., from dependent to dependee and from dependee to dependent.
Third, we needed unit tests, ideally lots of them.

While we worked on making those changes a reality, I explored what exact checks our dependency checker tool should enforce over a weekend. Given that Python’s version satisfiability is NP-complete, I considered building a symbolic reasoning engine on top of the SymPy package but realized that option was far too ambitious. Then I recalled seeing Russ Cox’s series of blog posts on Go’s module system a few months prior and revisited them, poring over the gory details.

Go’s package manager makes a novel trade-off between the expressivity of the version constraints and computational complexity, while also producing predictable results. Its starting point is semantic versioning, which provides clear rules for version number increases: increment the third, patch number for bug fixes, the second, minor number for new features, and the first, major number for backwards-incompatible changes. It then imposes four more restrictions:

Version constraints are limited to minimum version constraints, i.e., the >= operator in Python.
Major version updates require that the name of the package be changed as well. This corresponds to an implied maximum version constraint on the next major version.
Applications may additionally pin or exclude specific versions; these extra constraints are ignored when the same package is used as a library.
Instead of picking the latest package version, which may change over time, Go’s package manager always picks the oldest version fulfilling the version constraints.

None of these restrictions are onerous. The rules of semantic versioning are simple enough. Minimum version constraints are sufficient for capturing the features and bug fixes required for a package. Major version upgrades are already difficult and always require care, only now package maintainers feel more of the pain of backwards-incompatible changes — just as it should be. And appending, say, “v2” to the name of your package isn’t too difficult either.

At the same time, the benefits are tremendous: semantic versioning clearly communicates the expected impact of a package update. Go’s package manager can compute suitable versions in linear time (which is much faster than for NP-complete algorithms). It always arrives at the same solution, even if new package versions have been released in the meantime.

Fantastic! I found a realistic blueprint for realistic package management. But given the realistic engineering constraints of a startup, I had to make some pragmatic choices in realizing this blueprint. So I decided to punt on minimum version selection for now. Writing a wrapper script for pip would have to wait for another day.

I also decided to limit Python’s environment markers to constraining Python versions only and to partitioning the version space into two. A quick glance at the output of the dependency checker reassured me that this restriction was realistic and wouldn’t break any of our existing uses of environment markers. The restriction implies that checking the consistency of version constraints on a given package may have to be performed twice, once for each partition.

Finally, I decided to stick to the Go blueprint when it comes to version constraints and to allow minimum version constraints on release versions only, with an optional maximum version constraint on the next major version and no dev, pre, and post versions.

The resulting dependency checker tool comprises a little more than 2,600 lines of well-documented Python code (not counting tests) and has been running in continuous integration for a couple of months now in advisory mode. During that time, we fixed at least one bug — our ontology contains data, not code, and thus does not require dependency checking — and some but not all of the reported errors.

At the time of writing this blog post, 21 out of 34 repositories still include some non-compliant version constraints. Because there are so many affected repositories and fixing them all would require a clear commitment, some engineers have argued to just cut our losses and loosen the restrictions enforced by the tool.

The primary cause of contention is Python’s compatible release operator ~=. It really is syntactic sugar for a pair of minimum and maximum version constraints. That works out well when the operator is applied to a version number consisting of major and minor version only, since, say, ~=3.4 translates to >=3.4, <4.0 or >=3.4.0, <4.0.0 and thus is perfectly consistent with Go’s restrictions. But when the compatible release operator is applied to a version number consisting of a patch version too, it is inconsistent with those same restrictions. For instance, ~=3.4.5 translates to >=3.4.5, <3.5.0, i.e., includes a maximum version constraint on the minor version. The latter case makes up the vast majority of remaining errors reported by our dependency checker tool.

The pushback against fixing every affected repository serves as a useful reminder that enabling a new coding tool is only the beginning of the deployment process. It also requires generating buy-in from developers. In fact, we might just conclude that package management isn’t only a technical challenge but also a social one.

That’s just the topic of the third blog post in this series.

Mapping the World of Package Management

Robert Grimm — Tue, 21 Jun 2022 00:00:00 GMT

In late 2020, we started having difficulties with the installation of Python packages.

At times, the default package manager for Python, pip, would take hours for what used to take seconds. Worse, it might just refuse to install the packages altogether, with an error message indicating a version conflict.

With 15 applications and 19 internal libraries in our small business data processing pipeline, all of them written in Python and spread out over as many git repositories, humans and machines alike need to install packages with reasonable regularity — the humans to develop new versions of the code, and the machines to create container images for deployment to production, typically as part of continuous integration.

It wasn’t too difficult to identify the proximate cause for this highly unusual slowdown. Version 20.3 of pip had been released on November 30, 2020 and included a change of package resolution algorithm. The previous and fundamentally unsound “pick pretty much any version” algorithm was now disabled by default and marked for removal in a future version. Its replacement treated version constraints as gospel truth and refused to install dependencies when those version constraints were in conflict. That there were version conflicts in our codebase wasn’t surprising. After all, the version constraints appearing in manifests had never been validated for global consistency. Hence it was unavoidable that this transition would be somewhat rough.

It didn’t help that the pip team buried the news about switching to the new package resolver as default: it is the second-to-last item in the list of features updated for the preceding beta release 20.3b1. Furthermore, with a public preview period of only four months, the pip team made the switch too early. The resolver was still prone to exhaustively exploring versions that would never satisfy the constraints. It also contained at least 19 bugs and shortcomings that were then fixed over the following twelve months — compared to 14 that were fixed during the preview period.

A further source of confusion was that most engineers use the version of pip bundled with Python’s standard library. That makes the version of pip used to install packages dependent on the version of Python used to run code, while also introducing a significant delay until any given version of pip becomes widely used. Notably, if you are using Python 3.7 (as some of us still do) and ignore the pip warning exhorting you to upgrade, you are still using the old resolver to this day. If you are using Python 3.8 (as some of us also do), then the release of Python 3.8.10 on May 3, 2021 fundamentally changed how you install Python packages. The apparent disregard in all this for established versioning conventions is striking. Making backwards incompatible changes in a new minor version (pip 20.3) or in a new patch version (Python 3.8.10) is a big no-no under semantic versioning and not helpful.

As exasperated tales about pip’s misbehavior multiplied in our daily standup meetings, it became clear that we needed to start a concerted effort towards cleaning up our package requirements. At the same time, we had little insight into the scope of the clean up, i.e., the number of packages with conflicting versions as well as the number of conflicting constraints for each conflicted package. So I found myself tasked with producing just that survey in October 2021.

Alas, I couldn’t be bothered with the tedium of manually surveying 34 repositories for their requirements. So I decided to write a Python script for doing the survey and thereby set off on a journey into the wondrous world of package management.

The exploration revealed that the world of package management is large, covered in harsh terrain, and crawling with dragons.

At the same time, there are few maps of this world and they all suffer from significant inaccuracies and omissions. So we decided to follow in the footsteps of early map makers for Earth — Anaximander! Hecataeus of Miletus! Isidore of Seville! — and map the world of package management more completely and accurately, across a series of blog posts.

This first post will give you a coarse overview map of package management, describing common features found across most terrains and locating the largest and most persistent dragon infestation found in this world. It also identifies the other ferocious critters haunting Python’s package ecosystem, which goes some way towards explaining that ecosystem’s dismal reputation.
The second post will argue for maintaining your own custom maps and leveraging them to prevent dragon infestations. It will chronicle the transformation of a simple script for extracting package requirements from 34 version control repositories into a checking tool that is based on the dragon-repellent design of Go’s package manager and validates every merge request for those 34 repositories.
In the third post, we’ll discuss two additional map layers that are critical to understanding this world: the social and economic considerations related to package management, which serve as dragonbane and dragonwort, respectively.

I hope that our map-making efforts help you to better navigate the world of package management and avoid being attacked by dragons along the way. If our maps fall short of that lofty goal, then please remember that the likes of Anaximander (living in Miletus ca 600 BCE) and Isidor (living in Seville 1,200 years later) got a lot wrong as well.

Such are the risks when mapping previously uncharted lands.

A Simple Survey

A world map popular during the Middle Ages is the T-O map by Isidor of Seville. It was originally drawn ca. 636 CE but is shown here in its first printed version from 1472 CE. Source: Wikimedia, public domain.

Whenever we find ourselves on a journey into new and uncharted lands, it helps to draw a map. That way, we won’t get lost later on. For that to be the case, the map must be sufficiently accurate and detailed. Otherwise, we end up with a map like the T-O map by Isidor of Seville from 636 CE shown above. It clearly lacks detail. In fact, there isn’t any. At the same time, it isn’t as outlandishly inaccurate as you might think. If you take an orthographic projection of Earth with Jerusalem at the center, draw in the western half of the equator and the full meridian, rotate the map by 45º counter-clockwise, you end up with the map shown below. Et voilà, the T-O map is not that different from a modern map anymore.

An orthographic projection of Earth with Jerusalem as the center rotated by 45º counter-clockwise. Source: Wikimedia, public domain.

Let’s return to package management and start by defining our terminology. A package is an archive file that contains software that has been prepared for distribution. That software may be a complete application or command line tool, but quite often it only is a building block or library. Code may be in either source or binary form and targets either a particular programming language or operating system. Each package has a unique name and each release of the package has a hierarchical version number.

As you already saw above on the examples of pip 20.3 and Python 3.8.10, the version number typically uses dots to separate the major, minor, and patch numbers. In the case of Python 3.8.10 the trailing zero is meaningful: this version is the eleventh patch release of Python 3.8. Another software artifact depends on that package when it may possibly execute some of its code. To do so, the artifact requires the package by listing the dependency in its manifest. Each entry in that manifest names the dependent package and optionally also includes constraints on suitable package versions as well as the runtime environment.

The package manager implements the mechanics of a package ecosystem. Its primary two functions are to publish packages to some package registry and to install the dependencies of a software artifact by consulting that registry. Publishing a package requires a deliberate decision by the package’s developers or maintainers that the current state of the package’s code is worthy of consumption by others. The mechanics of publication include creating the required manifest and archives as well as uploading them to the registry.

Installing a package also requires a deliberate decision by the package’s users to build on a package’s contents. However, most package managers support the automatic installation of (some) package updates. The mechanics of installation include picking satisfactory versions for all dependencies, locating and downloading the corresponding archive files, validating their cryptographic hashes to establish integrity, and putting their contents into the right file system directories.

We can see why package managers have become so popular. They remove a lot of friction from the process of publishing and consuming software components, while also offering an embarrassment of riches in software packages for all use cases and purposes. Gone are the days when, at least on Unix-like operating systems, one had to search the internet for suitable code, locate a distribution mirror, download the source archive, expand the contents into a working directory, and execute the GNU Build System incantation:

./configure

make

make install

Hence it should be no surprise that package managers have become integral to contemporary software engineering. However, since most commercial corporations prefer not to share (all) their software with the commons, they typically prevent direct access to the package registry and instead use a local proxy that prevents external publication while also auditing the consumption of external packages.

For example, Enigma maintains its own internal Python package registry that is not accessible from outside the firm. That registry is used to publish internal packages. It also proxies access to the Python Package Index (PyPI) serving as public registry, which enables the 15 tasks and 19 internal libraries making up our small business data pipeline to depend on 220 external packages, comprising 60 direct dependencies and the transitive closure of their dependencies.

Hic Sunt Dragones: Here Be Dragons

Detail from the Hunt-Lenox Globe 1510 CE with the inscription at the center stating H(i)c Sunt Dracones: **Here Be Dragons. Source: New York Public Library / University of Rochester, public domain.

All that removal of friction comes at a price. The Hunt-Lenox Globe shown above is kind enough to point out where the dragons are hiding. Sam Boyer, who implemented most of the (by now deprecated) dep dependency manager for Go, created the equivalent blog post for package management. He doesn’t just point to the dragons. He warns us to stay away right at the start, titling the first section “Package management is awful, you should quit right now.” — Ooh, a challenge? This will be interesting and, yes, also fun! But we should still map out where all the dragons and other nasty critters are hiding.

The Dragon Boss: Installing Packages

We’ll spend the rest of this post wrangling the biggest dragon related to package management: installing packages. Fundamentally, it requires determining the transitive closure of an application’s declared dependencies. That includes the packages required by the application, the packages required by the packages required by the application, and so on until a fixed point is reached. Doing so is necessary just for functional completeness, so that the application has all the code it might execute readily and locally available.

The package manager also needs to figure out the correct versions of all packages to install. That is because any practical package ecosystem doesn’t just distinguish between different packages, which usually have a descriptive name, but also distinguishes between different versions of a package. Accommodating different versions is necessary to accommodate ever-present change. Versions are usually named by a three-part hierarchical number. For example, Amazon recently released version 1.23.50 of its botocore Python library for accessing AWS. Remember that the third, least significant component of that version number is 50, i.e., the zero is meaningful and the preceding version is 1.23.49.

When packages declare a requirement in their manifest, they provide a package name, a comparison operator, and a version number. For instance, the requirement “botocore >= 1.19.21” indicates that the application or library requires version 1.19.21 or later of the botocore package. As I described in a previous blog post, that is the first version to include a fully working implementation of the adaptive retry logic, which greatly simplifies the development of robust client code.

Now another library may also depend on botocore but specify “botocore >= 1.15.0” as a requirement. Now, since both requirements appear in the transitive closure of dependencies, our package manager must pick a version that meets both constraints — such as the larger of the two minima, which is version 1.19.21. Conversely, if the constraints cannot be satisfied by a single version, the package manager should report back with an error message that clearly states the unsatisfiable constraints, so developers can adjust the requirements.

Picking a suitable version is rarely as easy as in the above example. First, determining the transitive closure of packages to install and resolving version constraints interfere with each other: different versions of a package may require different packages. Second, the problem of resolving version constraints is NP-complete. In other words, version satisfiability is one of many problems that are equivalent to each other and have no known efficient solution.

At the same time, that class of problems has another, more helpful characteristic: when given a possible solution, we can efficiently verify whether the candidate is in fact a solution. In a pinch, then, we have a general but exponential time algorithm for computing a solution: enumerate all possible solutions (the expensive part) and verify each one of them (the efficient part). If a candidate holds up to scrutiny, we use that as the solution. If we run out of candidates, there is no solution. At least, that’s the theory.

In practice, the dependency graphs formed by package requirements and their version constraints often aren’t that big, and hence even that exhaustive search remains computationally manageable. Also, there is nothing preventing us from being clever and using heuristics that tend to produce a result sooner than mindless enumeration (but won’t always work). Options include SAT solvers, which solve the equivalent problem of boolean satisfiability and have gotten darn good at doing so, in part thanks to yearly competitions, as well as domain-specific algorithms.

Alas, practice cannot escape theory: the cliff of worst-case exponential performance remains very real — as our hours-long pip runs illustrate. Worse, the NP-completeness of the problem domain becomes a barrier to developing other tools, such as our dependency checker.

During development of some of the gnarlier analysis code in our dependency checker, I was seriously exploring the feasibility of extending a symbolic reasoning engine with a theory of version numbers. When that seemed a tad impractical for a weekend project, I started looking for package management algorithms that trade a reasonable loss in constraint expressivity for a gain in guaranteed performance. Lo and behold, I found Go’s module system, which restricts the expressivity of version constraints to no more than what’s required and gains predictable results computed in time linear to problem size in return.

But before it gets better, it gets a whole lot worse.

Python Packaging: From Dragons to Superfund Site

An overview of Python packaging. Names and versions have changed a bit since April 30, 2018 CE when this xkcd comic first appeared. Nonetheless, it remains a mostly accurate description of the state of the art for Python package management. Source: xkcd.com,CC BY-NC 2.5. Our sincere thanks to Randall Munroe for allowing us to reproduce this comic.

The above xkcd comic provides an approximate map of possible Python installations for a computer. In doing so, it also provides a first indication that Python’s package ecosystem is quite a bit more complicated than others. In fact, if anything, the above map is too simplistic. Python has six officially supported installation schemes, each of which comprises eight different paths. So-called virtual environments modify those paths even further and cross-compiling between environments is possible only with officially unsupported hacks. The difficulties don’t stop there. Additional complicating factors, roughly ordered from less to more severe, include:

Version numbers have several uncommon attributes, notably epochs and post releases, that complicate reasoning about versions without providing any additional expressivity.
Version constraints support an uncommon operator, the arbitrary equality operator ===, that has Python semantics and thus is hard to model outside a Python interpreter.
Environment markers further constrain requirements to, say, specific Python versions, Python interpreters, or operating systems. The number of distinct such environments deepens the complexity of dependency analysis.
Version numbers have no semantics, only an ordering. While that fulfills the minimum requirements for version satisfiability being computable, it also does not help with gauging the impact of any version increase. That, in turn, results in highly confusing scenarios — such as the one described above, where a patch release of Python (3.8.10) contains a minor release of pip (20.3) that includes a groundbreaking, completely backwards-incompatible feature.
Packages traditionally do not include their dependencies with the metadata. One must first download and build the package. This considerably slows down version selection and wastes tremendous network and package registry resources.
Packages traditionally specify their metadata in code, not in data. That makes it nearly impossible to extract the metadata without running arbitrary code, including package-specific plugins.

What you get is a package ecosystem with a poor developer experience that makes tool building much harder than necessary. Worse, the ecosystem saw few improvements during Python’s “lost decade” — when most packages supported two incompatible versions of the language and which lasted, roughly, from the first production release of Python 3 in December 2008 to the last production release of Python 2 in April 2020.

Thankfully, things are improving as of late and the rate of improvement is seemingly accelerating. Notably, support for pluggable build backends and the definition of a more realistic common metadata standard are significant steps towards a simpler Python package ecosystem.

Still, Python could do so much better if it rid itself of much of this cruft more aggressively while also rethinking foundational aspects more radically.

The site-specific map-making effort described in part two of this blog series will serve as partial proof of concept that doing so is feasible and has significant upsides.

How Enigma KYB Cuts Compliance Costs by Up to 80% Without Slowing Down Onboarding

Enigma — Wed, 15 Jun 2022 00:00:00 GMT

Business onboarding is one of the most friction-laden processes in financial services. Too much caution and you slow down growth. Too little and you take on real compliance risk. Enigma KYB is built to solve that tension — automating the verification work that most institutions still do manually, so you can onboard more customers faster without cutting corners on compliance.

The problem with most KYB providers

Most KYB providers can only instantly approve 30–50% of the businesses that apply. Everyone else gets routed to a manual review queue — which means documentation requests, back-and-forth with applicants, and high operational overhead. Those delays cost real money, and some portion of those applicants simply don't wait around.

The industry has a naming problem too. Many providers call this "auto-approval" — but auto doesn't mean instant. Auto-approvals are simply approvals that are automated from the customer's perspective, regardless of how long the process actually takes. Many of those approvals have manual processes running behind the scenes, often taking more than 24 hours to complete.

Instant approvals are different. They are both fully automated and completed in under 3 seconds.

Enigma KYB instantly approves more than 70% of registered businesses in under 3 seconds — roughly 1.5x more businesses than any other data provider.

How Enigma achieves higher instant approval rates

The higher approval rate isn't magic — it comes from richer data. Enigma enriches Secretary of State registration filings with additional layers of business identity information:

Foreign filings — registrations across multiple states
Operating names — DBA names used in practice, not just legal names
Operating addresses — where a business actually operates, not just where it's registered
Websites — the business's web presence as a matching signal

This enrichment means Enigma can match a business input like a DBA name or website URL back to its underlying SoS filing — even when the application doesn't perfectly match the registration. That matching capability is what converts businesses that other providers can't approve into instant approvals for Enigma customers.

The cost impact is significant. Because manual verification is expensive to run, increasing the share of businesses that receive instant approvals generates as much as 80% savings in KYB costs.

What Enigma KYB covers

SoS filing verification

Enigma collects and instantly verifies business and owner identity from Secretary of State filings across the country. Registration statuses are updated bi-weekly, so you'll know if a business is no longer in good standing before you onboard them — not after. You don't have to build or maintain the infrastructure to collect this data yourself.

High-risk activity classification

Some businesses operate in categories that require additional scrutiny before approval. Enigma automatically classifies businesses that conduct high-risk activities, covering:

Cannabis
Tobacco and Vaping
Firearms, Weapons and Ammunition
Adult Entertainment and Dating
Gambling and Sports Betting
Payments and Money Transfer
Multi-level Marketing
Pawn Shops
Gift Cards

When a business falls into one of these categories, it's surfaced as a high-risk classification so your team can make an informed decision — rather than discovering the issue during an audit.

OFAC screening

Enigma screens businesses and their owners against OFAC watchlists on a weekly refresh cycle. The lists covered include:

Watchlist	Abbreviation
Capta List	CAP
Foreign Sanctions Evaders	FSE
Non-SDN Menu-Based Sanctions	NS-MBS
Non-SDN Iranian Sanctions	NS-ISA
Non-SDN Chinese Military-Industrial Complex Companies List	NS-CMIC
Non-SDN Palestinian Legislative Council List	PLC
Specially Designated Nationals	SDN
Sectoral Sanctions Identifications List	SSI

Enigma's screening engine is used by top 5 U.S. banks, has been audited by regulators, and is proven to have low false positives without an increase in false negatives. That last point matters: reducing false positives without pushing false negatives up is genuinely hard to do, and getting it wrong in either direction creates problems — either unnecessary friction for legitimate businesses or sanctions exposure for your institution.

Staying compliant as regulations change

Enigma uses trusted government data sources — primarily Secretary of State filings — so that the verification you're doing against Enigma's data holds up under regulatory scrutiny. That means:

Business and owner identities verified against official SoS filings
Easy stakeholder screening against global sanctions and watchlists
Lower false negative rates that reduce friction during audits

If you're building or optimizing a KYB program, it's worth understanding the full landscape of process options available. Our guide A Guide to Optimizing Your KYB Process walks through the trade-offs between building in-house, working with a single provider like Enigma, and waterfalling multiple data sources. And if you want to make sure your data collection covers everything regulators require, the KYB Requirements Checklist is a good starting point.

Ready to see what instant approval rates look like for your portfolio? Learn more about Enigma KYB or get in touch with the team.

A Conference in Quotes: Fintech Nexus 2022

April Runft — Fri, 10 Jun 2022 00:00:00 GMT

We came back energized from the 2022 Fintech Nexus USA conference, formerly known as Lendit Fintech. The show delivered on its promise to connect members of the fintech ecosystem and showcase how leaders are thinking about the latest trends.

The agenda was full of sessions on topics like small business lending, data, buy now pay later (BNPL), embedded finance, payments, and innovation.

Here’s a summary of select sessions as a rollup of memorable speaker quotes.

On the opportunity around the smallest of small businesses:

“Capital in isolation is not the solution. The key is illuminating the path to becoming a recognized business. The smallest businesses need support with processes, like getting an employer identification number, and operations guidance to help them evolve along the path from solopreneur to microbusiness to small business to enterprise.”

—Sean Salas, CEO and Co-Founder, Camino Financial in Lending to Micro-businesses: The Most Underserved SMB Market

On prioritizing customer experience:

“Whoever owns the experience wins.”

—Chris Scislowicz, North America Credit Lead, Accenture in SMB Lending: Adapt to Succeed

On banks and competition:

“Banks haven’t yet grasped that their competition isn’t other financial institutions. Stop looking across the street. Look at anyone with a big balance sheet as competition. The new dominant player will be the one with the next killer app, one of the tech leaders who focus on experience.”

—Robin Smith, Regional Vice President, North America, Mambu in SMB Lending: Adapt to Succeed

On projects and change management today:

“We used to talk about people, process, and technology as the three keys to successful projects. Today that has become politics, policies, and culture.”

—Chris Scislowicz, North America Credit Lead, Accenture in SMB Lending: Adapt to Succeed

On whether fintechs are a threat to banks:

“If you’re a fintech and you want to move money, you need to partner with a well-run bank. Fintechs are friend to banks, not foe.”

“On these fintechs that are buying banks — great! It’s a good strategy if you’re willing to run a bank. Get ready for audits and regulation. It’s expensive. It’s hard.”

—Chris Dean, Co-founder and CEO, Treasury Prime in Why the Real Threat for Banks is Complacency

Photo: Real-Time Payments and Cryptocurrency session. At left, Chris Smalley, Managing Director of Digital Banking, Customers Bank and Daniel Webber, co-founder and CEO, FXC Intelligence

On cryptocurrency and blockchain’s potential:

“Banks have been unacceptably slow to adopt digital asset trading. This isn’t the crypto of 10 years ago. Today there are sophisticated institutional investors in the space. The goal is capital efficiency. Clearing fiat payments is wildly inefficient.”

“Even more than crypto, I’m excited about blockchain’s potential for other assets. There’s no reason tech can’t replace the ability to show physical ownership records. Blockchain can put title companies out of business.”

—Chris Smalley, Managing Director of Digital Banking, Customers Bank in Real-Time Payments and Cryptocurrency

On buy now pay later B2B use cases and shifting relationships:

“Buy now pay later options are becoming available to enterprise finance teams via invoices. It’s no longer so crucial for a founder to call upon their venture capital firm and banker for capital.”

—Miguel Fernandez, Co-Founder and CEO, Capchase, in Embedded Finance: Ubiquity or Winner-Take-All?

On what’s ahead for embedded finance:

“I think we’re going to see expansion into B2E (business to employee). For example, Lyft partners with Stride Bank to pay drivers after each ride or delivery in an "earn as you go" model. It's fully integrated into the Lyft app, so it's a seamless experience for drivers.”

—Michael Haney, Head of Digital Core, Technisys in Embedded Finance: Ubiquity or Winner-Take-All?

Photo: Creating a Virtuous Fintech Ecosystem for Small Businesses session. Speakers pictured left to right: Rob Daniel, Director, Product Management, Intuit Quickbooks; Tui Allen, Director of Product, Shopify; Cetin Duransoy, COO, Fundbox; Luke Voiles, GM, Square Banking, Block; Brock Blake, CEO & Cofounder, Lendio

On challenges facing small businesses from pandemic to present:

“We’ve seen merchants go from fully online to fully in store. They’re preparing for inflation and the coming decrease in consumer spending. We see our role here as building the infrastructure for commerce of the future and trying to remove the burden of managing the business side of small business.”

—Tui Allen, Director of Product, Shopify in Creating a Virtuous Fintech Ecosystem for Small Businesses

On how data and partnership unlock more opportunity for small businesses (SMBs):

“For us it’s all about a unique stream of data. Today we serve 51% women-owned businesses. Getting transaction intel from merchants means we can serve businesses no one else can because we can make decisions based on real-time data. But data is also a challenge. We look to pull in other transaction-level information to build out that holistic picture of a business, which often means buying new data sources.”

—Luke Voiles, GM, Square Banking, Block in Creating a Virtuous Fintech Ecosystem for Small Businesses

“For us, the big question is ‘how do we best equip SMBs and get them better access to capital?’ So we’re happy to send SMBs to partners when that makes sense versus feeling protective. And we can be open about where Quickbooks won’t be able to prioritize certain capabilities, so partners can focus there and better serve SMBs.”

—Rob Daniel, Director, Product Management, Intuit Quickbooks in Creating a Virtuous Fintech Ecosystem for Small Businesses

Introducing Business Hierarchies

Alexander Lee — Wed, 25 May 2022 00:00:00 GMT

Understanding a company’s growth potential or risk requires taking a look at the full picture of the business – across all of its physical locations and online revenue streams.

To provide this comprehensive view of a small business’s financial health to our customers, we are excited to introduce two new product features:

Business hierarchies: Customers can now see the relationship between a business and all of its individual locations. Queries on an individual location or URL will display total revenue data for all related locations of that business and its online channel.
URL matching: Customers can now query Enigma’s data with a URL.

Getting the full picture: Why it’s important

These new product features were born directly out of our customers’ needs.

In the words of Pieter Van Ispelen, VP of decision science at Divvy: “For our decision-making, it’s critical that we get the full picture of a multi-location business, including online revenue, in one view. These new capabilities will be valuable for our team.”

I’ll illustrate why this is so critical with an example: Let’s say I opened up a coffee shop called Xander’s Coffee, and it experienced such explosive growth with its superior on-tap cold brew and matcha lattes that I was able to quickly open up a second location across town.

The second location was very successful as well, though it cannibalized some sales from the first location, and I decided to apply for a loan to open a third Xander’s Coffee. If the bank looked only at revenue information from the first location, they might be hesitant to lend to me since that location was no longer growing. But if they looked at data across both locations, they would see that the business was indeed quite healthy and a strong candidate for the loan.

Finally, let’s say I started selling tons of Xander’s Coffee branded coffee beans through the company’s website. Someone financing the business would want to know this information too.

In fact, another of our customers used this very example (a coffee shop with three locations) to highlight the value of these features across multiple use cases. They explained that they would want to understand its financial performance aggregated across the three locations to both market and underwrite relevant products to serve their customers – and, specifically, that they would want to offer the shop one business credit card, not three.

For our decision-making, it’s critical that we get the full picture of a multi-location business, including online revenue, in one view. – Pieter Van Ispelen, VP of decision science at Divvy

Our solution

With our new business hierarchies feature, customers can now see the relationship between a business and all of its individual locations – and an aggregated revenue figure.

The following diagram shows how the fictitious Xander’s Coffee would be represented in the Enigma dataset:

Note in the diagram that the parent business, Enigma ID B123, displays the sum of revenues from all locations and its online channel.

To receive information about a specific location (e.g. E456 in the diagram) or about the entire business (B123), you may continue to submit a business name and address, or person associated with a location of the business. What’s new: now you can query our data with a URL to retrieve comprehensive information about the associated business (B123). Since URLs are unique, we believe they make a very reliable search parameter.

With these features, we are proud to provide even more accessible and actionable insight into the card revenues of 80-90% of U.S. card-accepting businesses.

See business hierarchies in action

We are excited for you all to try this out for yourselves! Take a look at our API documentation for the technical details or get in touch to talk about how this data can be useful for your organization.

What is CRM Data Enrichment? A Complete Guide

Enigma — Mon, 23 May 2022 00:00:00 GMT

You rely on the customer data in your CRM – customer relationship management platform – every day when identifying and evaluating potential prospects. When this data is complete and current, your sales and marketing teams can do what they do best and conduct more targeted, successful prospecting to draw in new customers. But if it’s not? Your customer acquisition efforts are more likely to come up short.

With enriched data in your CRM, you can avoid this outcome and empower your teams with the tools they need. Let’s learn more about what CRM data enrichment is, how it can benefit your business, and best practices for success.

Your data and your CRM

Every day, you directly collect raw customer data from a variety of sources, including advertising, demand generation emails, website lead forms, social media, and more.

To help you make sense of this information, you likely enter it into a CRM to store, centralize, and organize it in a way that your sales and marketing teams can utilize and leverage it for the creation of use cases.

This collected data is usually grouped according to four different categories to help build comprehensive customer profiles:

Personal: Personal data may include the name or company name, location data, job title of the prospect, and other basic details.
Categorical: Categorical data may include not only demographic information like gender but also information such as job department, length of employment, and seniority.
Contact: Contact data may include addresses, phone numbers, email addresses, and social media accounts.
Firmographic: Firmographic data may include information on a company, office locations, industry affiliations, number of employees, and more.

You may also have additional fields to keep track of previous outreach or communication efforts, and any other relevant notes.

What is CRM data enrichment?

Despite your best efforts, not all of this data collection may be complete. You may be missing any number of important details on a potential lead, which could create challenges for your sales teams. For example, online contact forms tend to be short and simple to encourage more people to fill them out, leading to unavoidable data gaps. You may also be dealing with entries that are out of date and haven’t been touched in months, or even years - leading to poor data quality.

Organizations like yours can address these issues through data enriching your CRM and more effectively building out your small business database. Also known as data appending, data enrichment is a process that supplements your existing firsthand CRM data with third-party external data to expand, fill in, and update your data sources.

Data enrichment is different from data cleansing, in which you remove broken, duplicate, or otherwise unnecessary entries from your CRM. Data enrichment adds crucial information to your CRM — eliminating crucial gaps and ensuring that you’re working with the most up-to-date information to obtain valuable data points and insights.

What CRM data enrichment does for you

Successful customer prospecting starts on a foundation of good data and can lead to many good business opportunities. If you make CRM data enrichment a priority, you can be sure that you’re consistently providing current, reliable, and detailed data for sales and marketing.

Greater personalization

The more you know about a potential customer, their needs, and their pain points, the more you can personalize your outreach to create a sales pitch that resonates. These kinds of targeted campaigns increase your opportunities to connect with prospects and turn them into meaningful customer relationships.

Increased productivity

With better data comes faster, easier, and more accurate decisions and workflows. Your team can score leads more effectively, and you won’t have to waste time pursuing incomplete or dead leads. You can focus your money and your resources only on the prospects that fit your ideal customer profile.

A better bottom line

When you work with high-quality data, you can conduct smarter, more efficient outreach that saves time and money, and lets you prioritize the most qualified leads—helping you to increase your conversion rates and your ROI. It also helps you make better business decisions across your entire organization.

An edge in the marketplace

If you and your competition are going after the same prospects, you have the upper hand if you have more accurate data and detailed insights needed to create a more compelling customer experience. If you don’t, that could put you at a disadvantage when trying to make a personal connection and convert promising leads into new business.

Build a lasting CRM data enrichment strategy

When done right, CRM data enrichment can offer tremendous benefits—but implementing it can be tricky. To avoid common pitfalls, you should:

Set clearly defined and achievable goals.
Audit your existing data so you know exactly what you have and what you need.
Plan to enrich your data at regular, ongoing intervals so you always stay up to date.
Implement a data enrichment tool and/or software to help support the process.

Selecting B2B data enrichment tools is a critical step; while you can enrich your data manually, it is more efficient and accurate if you leverage an enrichment solution that can help you collect the best data possible. Depending on your needs and the scope of your data, you can opt for something as simple as browser plug-ins that integrate with platforms like LinkedIn and other social apps.

For more in-depth data enrichment needs, companies can opt for a data enrichment service provider, who can often carry out these processes for you automatically without any additional action or oversight needed on your part.

There is no one-size-fits-all solution for CRM data enrichment, but for companies that are considering a data provider to assist with their enrichment initiatives and help improve the accuracy of the data itself, a provider like Enigma can offer the highest-quality data to support both your current and future needs.

With a custom mix of public, private, internet, and hard-copy records, Enigma’s trustworthy data sources combine public, private, digital, and analog information to compile a full view of any small and medium business.

With actionable data on everything from transactions and revenues to financial risk factors, Enigma’s in-depth data reports can change the way you identify, evaluate, target, and convert prospects. Our APIs can even integrate into your existing CRM, so you don’t have to change your current data platform and processes.

Ready to learn more about Enigma and how our data can support and transform your B2B sales cycle? Sign up for your free demo today.

B2B Prospecting: Definition, Guide and Tools

Enigma — Fri, 13 May 2022 00:00:00 GMT

Your business depends on high-quality B2B prospecting and best practices.

Whether you’re trying to find new prospects or connect with the ones already in your CRM, your ability to grow your enterprise depends on identifying high-quality business prospects and converting them into high-quality clients.

To drive effective and efficient B2B prospecting, you need the right guidance and prospecting tools—and that means the most current, accurate data for sales and marketing teams on the small- and medium-sized companies being targeted.

What is B2B Prospecting?

B2B prospecting is the process of identifying, contacting, and converting potential prospects from your sales pipeline into new clients. To prospect effectively and match your ICP, you don’t just want to amass as many leads as possible. You want to focus on finding qualified leads that fit your customer profile, so you can better target your outreach, make strong connections, and build lasting relationships.

B2B prospecting is the earliest stage of the sales funnel, and one of the most important steps to get started — the more you can keep your sales and marketing pipeline filled with quality leads, the more you set yourself up for present and future success.

Different Methods for B2B Prospecting

There are many different sales tools and activities involved in effective prospecting, all of which are built around establishing a relationship with prospective business prospects and target accounts. Some of these sales prospecting tools empowering sales reps include:

Cold calling: Reaching out to prospects by via cold call to explain your products and services, what sets you apart, and how you could help them address their specific business concerns.
Email marketing: Sending targeted emails (and not just standard email templates) to prospects to supplement cold calling efforts, often with links to special offers or demos. These may or may not be cold emails, based on what stage of the prospecting cycle you’re in.
Content marketing: Creating, publishing, and distributing blogs, case studies, or other informational content that help establish you as an expert in your vertical to help target accounts.
Social selling: Maintaining a strong, consistent presence on social media sites like LinkedIn, as well as industry-specific forums.
Strong search engine optimization: Fully optimizing your pages for the SEO keywords you rate the highest, combined with paid search ads to provide extra targeting for specific words and phrases.
Live demos: Demonstrating your product and service offerings by walking prospective clients through them in real time, addressing their pain points, and giving them more information on the potential benefits for their business.

None of these prospecting and lead generation strategies should be performed independently of one another. They can be arranged and combined in packages called cadences, which give you more chances to make a sales pitch that connects with your prospect and aligns with your overall prospecting strategy.

Successful B2B Prospecting Strategies

Of course, no matter how skilled or enthusiastic your sales and marketing team is, you can’t just jump right in and start cold calling random companies. You may have only one opportunity to connect with a prospective business—and one chance to sell them on your offerings—so you have to make sure you’re prepared with what works best.

Before you pick up a phone or draft an email, you need to:

Create your customer profile

Ask yourself one question: Who are you targeting?

You should be able to answer that before you go any further. Not necessarily with specific company names, but general guidelines for your target audience, including the kind of company, industry, size, revenue, and more. Compile all of the criteria that are important to you, and use that information to create a basic profile of the business prospect you want to target.

This will become the basis on which you build all of your B2B prospecting efforts.

Build your CRM

Next, you need to identify your potential buyers.

Maybe these are companies you already have in your CRM, in which you can sort and filter by the ones that meet your ideal customer profile. But you should also be prepared to search for new small and medium-sized business prospects to add to your small business database.

There are several different ways to expand your CRM:

Ask current customers for referrals.
Develop a consistent presence in relevant industry forums and discussion sites.
Reference Google, LinkedIn, and other social media sites.
Partner with outside business data vendors.
Leverage B2B data enrichment to improve existing customer information.

Take these companies and evaluate them according to your customer profile. Depending on how closely they match your desired criteria, you can identify which ones are the top prospects you should focus on.

Do your research

You have your profile and you have a list of prospects. Now, it’s time to do your homework.

Research each company extensively to understand who they are, what they do, where they work, and more. The more you understand them, the more likely you are to make a personal connection and create a pitch that resonates. This not only includes the areas they excel at, but also what they lack. These pain points are critical, because they can be the opening you need to make a compelling sales case for your products and services.

As a part of this work, you should try to identify the right person at the company to contact. You may find specific names, especially through LinkedIn. But at the very least, you should define a general buyer persona—the types of roles that would be responsible for making a sales decision.

In addition, you should also consider why they might object to you—whether it’s cost, product features, support, or something else. That means stepping back and taking an honest look at your company and your potential shortcomings (versus other competitors), and creating a message that turns them into an advantage.

As you do your in-depth research, you may discover that some companies you thought were top prospects really aren’t the best fit for you, or they aren’t ready to buy just yet. That’s ok! B2B sales prospecting can take a little trial and error. The more you can qualify your leads ahead of time, the more targeted (and successful) your prospecting methods will be.

B2B Data for Your Prospecting Today—And Tomorrow

Many aspects of B2B prospecting can be aided or automated by software tools. But no matter how good your tools or your sales and marketing teams are, your success at B2B prospecting and closing the deal starts with having the right data. If you’re not working with information that gives you the highest quality, high-priority leads, then you won’t be successful at attracting new companies.

That’s where Enigma comes in.

When you give your sales and marketing teams the same powerful data and insights that the rest of your organization has, you can approach your B2B prospecting with more clarity, focus, and speed. Enigma’s data sources transform the way you generate new business, from the moment you start prospecting to the moment you onboard a new business client.

Set buyer personas, create more targeted prospecting, increase your ROI—and close the deal faster and more efficiently. To make it even easier, our APIs let you integrate it with your existing systems and processes, so you don’t need to completely change the way you work or the software you currently use.

Interested in learning more about Enigma’s data sources and how they can help your B2B sales prospecting techniques?

Schedule a demo today.

A Pandemic Debut: Chicago’s Door 24 Wine Shop

April Runft — Fri, 29 Apr 2022 00:00:00 GMT

Alcohol sales have been on the rise over the past few years, a trend accelerated by the pandemic. As bars and restaurants closed for lockdowns, sales of alcohol for consumption “off-premises” saw steady increases.

Despite the increases in off-premises buying, it wasn’t a big year for wine shop grand openings.

Our data show new beer, wine shop and liquor store openings declined over the three-year period of 2019 to 2021, with a steep drop off between 2020 and 2021, as you’d expect. And yet.

Kevin Kenny and Michele Fitzpatrick became “pandemic entrepreneurs” when they opened Chicago’s Door 24 wine shop in May 2021. The shop specializes in small-production wines and showcases wineries led by women and people of color.

Search for data on companies like Door 24 – or your neighborhood shop – with Enigma's Console.

Kevin believes the challenging pandemic environment forced him and Michele to think differently from the start and stay agile.

“We offered online ordering with curbside pickup right from the beginning because we had to,” he says. “Otherwise, I don’t know that I would have even thought about it. It’s been a great option. When Covid gets worse, the online and curbside orders start picking back up.”

Illinois liquor laws mean the shop can only ship within state limits, and Kevin says in-state shipments get a small amount of traction. “I do ship a little bit to places like Springfield [Illinois’ capital], because they don’t get this kind of wine down there. But it gets expensive to ship, because of the packaging and needing to verify age at the time of delivery.”

Confidence and predictable revenues

Annual card revenues for these types of stores nationally increased nearly 18% from 2019 to 2020, with a nearly flat change between 2020 and 2021, according to our data.

The average total for a customer’s shopping trip also increased. From 2019 to 2021, transaction amounts at shops nationally grew 16%.

Stability with store revenues has been important for planning Door 24’s inventory. When shelves are fully stocked, inventory adds up to about $45,000 in wine.

The balance of revenues and expenses at the start of the new year can be especially tough, with net-30 terms from the holiday season coming due and revenues dipping with Chicago winter weather. “Working capital would be really helpful in January to offset the giant bill that comes due from stockpiling inventory for the holiday season,” Kevin says.

One way Kevin and Michele can pre-program monthly revenue – and like to pilot new wines – is with their wine club.

For a monthly fee, members get two curated bottles of wine, picked up in store, complete with detailed description of the varietal and winery, plus recommendations on how to pair each with food.

“That’s the box they put me into for loans”

When it comes to accessing capital and completing government processes, Kevin says the shop gets lumped in with liquor stores and bars, despite their differences, with real consequences.

It echoes an “economic friction” that Karen Mills, author of Fintech, Small Business & the American Dream, calls “heterogeneity”: small business needs and operations vary widely, yet our financial and government systems tend to treat them the same.

“There's a rigidity and in the loan system, they look at me as a liquor store, like a corner bodega, which isn't really what I do. But that's the box they put me into for loans,” he says. “When we first started out, we used our own money. But we’re pursuing our on-premise license now so people will be able to sit down and enjoy their wine here. To meet requirements, I have to put in another bathroom, and that will take capital. We’re just starting that process.”

In order to build up inventory and get operations running, Kevin and Michele didn't pay themselves for the first five months in business. It was a challenge for lenders to see beyond. “They’d ask, ‘do you have income for the first five months?’” Kevin says. “And I do other things, so I had a little bit, but not much. And I had to be like, well, I’m growing the business. We just put everything back into the business at the time.”

Door 24 has had similar challenges with seeking city permits. “The city is very inflexible. To the city I am absolutely a liquor store, no different than the corner store selling pints of malt liquor. I remember when I went to the first hearing to get our zoning approved. The only person that complained said that they were worried about people drinking in the alley behind the shop. I was like, ‘well, we're selling higher-end wine and I don't really think that's going to be a problem.’ But the city doesn't look at it that way.”

After eight months of waiting on their permit – paying rent on the empty shop – and a mere three minutes in front of the zoning board, ultimately Door 24’s permit was approved and Kevin and Michele could open their doors for business. It didn’t take long for Door 24 to gain recognition as a neighborhood gem.

Local wine shops “live and die by the neighborhood”

In 2021, over a one-month period, beer, wine and liquor stores nationally saw a median of 46 card-using customers per day, according to our data.

By Kevin’s count, 60% of Door 24’s revenues come from repeat customers. Of those, 75% are from a radius of about 10 blocks from the store.

“A shop like ours lives and dies by the neighborhood,” Kevin says. “The neighborhood is always going to be the people that regularly come in. People come in from across town, but they're not going to keep you in business.”

So what’s next to bring in the neighborhood clientele and beyond?

“With the on-premise license, now we’ll be able to put tables out front so people can come have a bottle of wine in the summertime,” Kevin says. “And with Covid receding, we’re going to start doing more wine education. I’d like to host classes once a month.”

And if Kevin could leave us with one recommended wine to try?

“Timorasso,” he says. “It's a full-bodied white from the Piedmont in northwest Italy with great complexity, depth, and a rare aging potential. Timorasso is a buzzy grape in Italian wine circles but isn't widely known. It can be hard to track down a bottle but it's worth it.”

How are small businesses in other industries performing? Explore our small business data to find out.

Introducing Enigma for Marketing & Sales

Abhinav Rai — Thu, 14 Apr 2022 00:00:00 GMT

Timely and accurate data about small business health has been hard to come by. It’s a challenge we’ve been helping risk and underwriting leaders solve when it comes to determining creditworthiness.

From conversations with marketing and sales leaders, it’s clear that getting timely small and medium business (SMB) health data is also a major challenge for organizations that find and serve small business customers.

To address that need, we’re rolling out two new ways for marketing and sales leaders to access and use our SMB data.

Discover: Build customized lists of your ideal customers.
Enrich: Identify and prioritize the top prospects in your database.

The problem: challenges for marketing and sales teams today

Often, internal data about your customers isn’t broad or complete enough to give you actionable insights. That’s what we heard from one digital marketing leader at a super-community bank, who was looking to find growth opportunities among a flood of new customers from the Paycheck Protection Program.

We noticed other marketing and sales teams were facing this and similar challenges:

Data sources abound, but it’s hard to find data that can accurately predict top prospects
There hasn’t been a great way to filter businesses by accurate revenue and industry, especially for the smallest businesses
Many marketing leaders want to identify growing SMBs of a given size, especially those with less than $10M in revenue
Existing options for industry attributes weren’t granular enough to allow leaders to target the exact segments of greatest interest
It’s difficult to confirm real, operating SMBs on lead lists
Leaders say they’re often discovering many SMBs on lead lists are out of business
Point-in-time lists don’t reflect major events in a business's lifecycle
Getting in touch with prospects close to a major business event can generate much higher return on investment for marketing campaigns—but that’s hard to monitor manually

In other words: like risk and underwriting teams, marketing and sales teams can’t make effective decisions on stale data, either.

The need was clear. We set to work on two new products – Discover and Enrich – to help marketing and sales teams better understand their customer bases and prioritize growth opportunities.

Data with high (accuracy) standards

Our dataset of SMBs debuted for risk and underwriting use cases. Risk decisions are expensive, so our bar for accuracy has to be high.

Our dataset reflects 40-50% of all cards in the United States, covering virtually any business that accepts a credit card — more than 18 million business locations. The data contains signals like card revenue and growth based on real card transactions, and whether a business has an ecommerce presence. We layer in granular industry classification drawn from online sources.

Discover and Enrich draw upon this dataset, offsetting the marketing and sales pain points we heard, so you can now:

Discover the small businesses that matter most to your business
Generate custom prospect lists of real, operating business locations based on the attributes you care about most, including revenue, growth, industry, and key events in a business’s lifecycle
Identify new and fast-growing businesses before the competition
If they accept cards, even the smallest businesses are captured in our data
Confidently predict which businesses will be top prospects
Enrich your prospect or customer lists with near real-time data about industry, revenues, growth, or average transaction size to better prioritize opportunities
Don’t waste money, time and effort on marketing to inactive businesses
Identify promising opportunities for cross-sell and up-sell with current customers

Here are a few examples of ways you might put the data to work:

Show me a list of all growing retail locations in the Northeast with >$100,000 in annual card revenue.
Identify all of my current customers that are in the restaurant industry and have suddenly started to grow in the last 3 months.
Pull a list of the fastest growing e-commerce businesses with >$1 million in annual card revenue.

Getting continually refreshed data with highly accurate industries and revenues, plus compelling events in a business’s lifecycle, can boost ROI on your marketing and sales campaigns, unlock growth opportunities, and save you from wasting marketing budget.

What’s next?

We’ll continue to build out our tools for marketing and sales teams. Here’s a look at a few items on the roadmap and what they’ll unlock for you:

Build lists of umbrella businesses, in addition to business locations
Filter on additional attributes, like headcount and headcount growth, and
Access a self-serve user interface, where you can generate your own lead lists based on custom specs.

Learn more about Discover and Enrich.

Related download: SMB Data to Unlock Revenue Growth (PDF)

Promoting Our Data Testing Paradigm with Internal Serverless Websites

Ayyoub Boutaieb — Wed, 30 Mar 2022 00:00:00 GMT

For companies wanting to share repositories or internal information in a secure manner, hosting websites internally is a must. So when our team faced the need to host internal static websites and did not have a clear company standard as to how to do so, we set out to establish our own.

With great data comes Great Expectations (Framework)

Providing a data-centric product under demanding service-level agreements requires repeatable data testing. We recently converged on the Great Expectations Framework to run dataset-specific assertions on every data set we ingest.

The Great Expectations Framework includes features to make your data testing journey as seamless as possible. One capability is a visual documentation tool, aptly named Data Docs, that renders the results of past data tests as HTML.

We use Great Expectations to provide comprehensive data tests to proactively identify issues in our data pipeline. This ensures that discovering and fixing data issues is fast and reliable, allowing us to consistently provide data quality at the level our customers expect.

Of course, fixes sometimes require a cross-team collaboration that involves both technical and non-technical users. To allow our non-technical users to independently remedy data issues, we need to provide them with enough information to understand why the data test failed. This requires us to provide visual feedback for every dataset’s test run.

Data Docs are therefore a perfect fit. Given the sensitive nature of the data and tests at hand, the Data Docs need to be hosted internally.

Generalizing the use case

While researching options to host Data Docs internally, we learned of other teams needing to internally host static data. An important feature for this use case is the ability to list the website’s files in an indexed format through a URL.

We settled on finding a common way to address both use cases. We’d look for a custom storage solution that: a) we could point to using a DNS and parse without compromising security, and b) would allow us to keep the websites internal. Both teams wanted to keep the hosting simple in order to avoid tremendous development and maintenance overhead while alleviating cognitive overhead on whoever needs access to said information.

Final requirements

Our final common solution should then:

Serve web pages residing on S3 (HTML, CSS, and JS files)
Be discoverable via DNS but only through an internal VPN, and
Support HTTPS or the ability to add an SSL Certificate for secure access without each use case needing its own certificate definition.

To make the solution extensible to future use cases, we also needed to create a framework that allows us to add and maintain internal sites with minimal effort – ideally through a configuration-driven environment.

Establishing the company standard

Our Ultimate Hosting Solution

We ended up centering our solution on an Amazon Virtual Private Cloud where we hosted an ECS instance with a Fargate Task. The Fargate contains a NGINX instance that loads and renders our websites from their designated folders or buckets. Inside the Virtual Private Cloud, we added a single instance of Amazon ALB for its SSL and firewall support. On the routing side, we kept Amazon Route 53 from a previous research iteration as our DNS, due to its scalability in serving multiple nested private subdomains (or zones).

Not only did adding the Fargate Task allow us to meet our staticity requirement, but it also made the setup serverless and, therefore, easier to maintain.

On the storage end, we had to ensure that our S3 storage setup and separation of IAM roles did not interfere with the loading or access of the website. So we configured IAM-based VPC access on specific buckets for internal static web hosting to allow the Fargate Task to browse said buckets.

The solution solved our internal static hosting need, but having to configure S3 and Route 53 for each of our websites would still take some development overhead. Luckily, we could minimize the effort through a configuration-driven approach for our framework.

The Hosting Framework

To alleviate maintenance and development overhead on current and future use cases, it was important to lessen the amount of work to be done from the ideation of the use case to its proof of concept. And leaving the CI/CD logic and mechanism up to the website’s internal owners, we consequently focused on ensuring the following:

Our VPC environment was pluggable on any bucket it was tasked to map
A website storage entity (i.e. a new and standalone S3 bucket) would be easy to create and map to
A new DNS route on our Amazon Route 53 private zone would be easy to create and map

With the VPC being a one-time internal configuration, we decided to concentrate our efforts on creating:

A Terraform configuration template so that our S3 buckets were private, had “Static Website Hosting” enabled, and granted read access to our VPC
A Terraform configuration template for a DNS record to point to our internal Load Balancer
An NGINX configuration template to proxy pass our S3 bucket’s URL on our new DNS name
A CI process that automatically validates and applies new configurations

Finally, to make the method of hosting an internal website more accessible to anyone within the company, we wrote an internal how-to guide, including all the required code templates for a quick copy-paste and plug-and-play approach, finally closing the requirement loop.

Other options we researched

The Gitlab Pages Route

With our self-hosted instance of GitLab, it initially made sense for us to look at the use case of GitLab CI to build and deploy Data Docs to GitLab Pages internally. We already had experience with this particular paradigm as we previously developed and maintained a custom dataset-centric metrics dashboard on GitLab Pages.

During our feasibility research for Data Docs, we encountered a complication where GitLab Pages’ deployment had to be triggered through CI. In contrast, our implementation of Great Expectations runs in a Python context under Airflow, and Data Docs has to be refreshed at each run of Great Expectations.

A solution we could have gone with in this paradigm is to have each run of Great Expectations trigger GitLab CI to recalculate and redeploy Data Docs. However, assuming the scenario where two or more data ingestion and validation pipelines run simultaneously, both pipelines would be attempting to trigger their respective runs of Data Docs calculation and deployment. That concurrency could have unforeseen volatility around Data Docs’ deployment output: it could technically invalidate the assertion process and its logs and create delays due to mandatory, manual verification of the Data Docs render. We therefore decided to research more hands-on paradigms that could accommodate our needs.

Using a DNS to Point Directly to Amazon S3

We use AWS extensively, so we looked at the tooling we already use on a daily basis. Given that Amazon S3 has the option of hosting a static website, our initial idea was to use Amazon Route 53 as a DNS to point to an Amazon S3 bucket.

We steered away from this option because S3 neither supports HTTPS nor does it accept requests made from a different domain.

Using Amazing Cloudfront

Fortunately, we found a way to solve the issue using Amazon's CDN, CloudFront. This allowed us to add HTTPS support to our S3 elements.

After further investigation, we had to forgo this path: CloudFront is only able to serve S3 content through the public domain, potentially exposing the files and information we would put in our S3 buckets.

Conclusion and next steps

From initial research to proof of concept and then adoption, the whole process was a true multi-team effort involving various Enigma stakeholders from Tech Ops, Web Developers and Data Engineers. This new infrastructure is already bearing fruit, both in terms of impact and adoption: we have already built the two previously mentioned use cases and migrated a pre-existing project from GitLab Pages.

A “nice to have” for convenience would be automating the template generation and mapping through CI. Making the process of hosting a full-on internal static website as simple as populating variables such as a bucket name, website name and homepage file name would truly make it easier and faster for our less-technical users.

5 Calibrations for Hybrid-Remote Work

Ryan Green — Tue, 15 Mar 2022 00:00:00 GMT

I’ve recently been reflecting back on our transition to remote work—particularly on how we got it wrong. Then, how we got it right.

We recognized early on in the pandemic that we weren’t going back to our offices, and by August 2020 had rolled out a permanent hybrid-remote policy. So, fortunately, our employees didn’t experience the whiplash of multiple canceled return-to-office dates with each new Covid variant.

As a leader with twenty years of experience managing technical teams, I needed to figure out the fundamental difference between in-person and remote work.

The communication threshold

The communication threshold is the conscious decision whether to reach out to someone to ask a question or share an update.

I’ve observed that it requires a greater degree of comfort and motivation to contact a colleague when you’re physically distanced. And it progressively increases the less frequently you work with someone. This seemingly small factor has an outsized impact on work outcomes across technical teams.

For example: let’s say I’m a developer who needs Ben to finish upgrading a shared library so I can submit my merge request. Here’s how that scenario may play out in different environments:

In person: I look over at Ben’s desk and see he’s chatting with his neighbor. I decide it’s a good time to drop by and ask him how it’s going and when he thinks his change will get reviewed. I get a status update and ten minutes later, I’m back at my desk proceeding with other work.

Remote: I glance at our Slack channel on and off all morning, watching for an update. I think about Slacking him at noon but hold off, thinking, “it’s not urgent and, for all I know, Ben is working on a production issue now.” There’s no status update by the end of my workday.

While the impact here is minor, at scale these interactions accumulate and, even in a medium-sized company like Enigma, lead to a significant loss of team effectiveness.

Once we understood the subtle impact remote work has on our communication norms, we were able to develop techniques for more effectively working together.

It requires a greater degree of comfort and motivation to contact a colleague when you’re physically distanced. And it progressively increases the less frequently you work with someone. —Ryan Green, Chief Technology Officer, Enigma

Five calibrations for hybrid-remote work

In late 2020, we undertook our first large, cross-functional project as a remote company. While we delivered a major product innovation on time, it wasn’t a pleasant experience for anyone. Our higher communication thresholds meant we weren’t communicating or managing dependencies effectively. This led to multiple cases of last-minute notice, rework, late nights and unnecessary stress. It became clear we needed better tools and methods for delivering complex projects as a remote team.

After a retro on the project in early 2021, we made five calibrations to improve our ability to deliver high-quality products as a remote team:

1. Build structure to keep the information flowing

In a remote environment, the cognitive cost of staying in sync is higher. You can’t just turn around to ask a question, so more questions may go unanswered. We’ve become very disciplined about sharing information.

We organize all product requirements and engineering specs in a project workspace. We want everyone in the company who’s interested to understand every project (we use Notion).
We implemented a shared calendar for review sessions of every product requirements doc and engineering design. This ensures that anyone from the company who’s interested can join. These open peer reviews have prevented weeks of heartache by building awareness of changes early on.

2. Get specific in planning

In a remote environment, coordination between different teams is more challenging. While individual teams may have daily stand-ups to coordinate efforts, there’s no virtual equivalent of walking over to chat with someone on a different team (polling someone regularly on Slack to ask for updates generally isn’t appreciated).

The solution we’ve adopted is to require a detailed project plan and a short weekly project sync for every piece of work that will require more than one engineer sprint. This simple technique has allowed us to complete all 15 projects this quarter within two weeks of the projected completion time—without teams putting in extra hours.

Planning sometimes gets a bad rap. Much of this comes from organizations that use plans to coerce teams into signing up for unrealistic goals. We use plans as a tool to improve the quality of our engineering and data science work. Plans allow us to see where we need to cut scope to provide breathing space to build maintainable systems (simple designs, high-levels of automation and testability, effective documentation, etc.).

We also know that plans cannot account for surprises that inevitably require additional time to work through. Planning is a cognitive tool for thinking realistically about the future so we can provide a high degree of predictability and transparency to everyone involved.

3. Invest in documentation

The higher cost of communication in a remote environment revealed that we had underinvested in documenting our systems and processes. We repeatedly observed our most experienced people resorting to asking simple questions about our systems on Slack that they couldn’t answer. This clearly illustrated the cost of relying on an informal system of “tribal knowledge.”

Now, we have a policy of adding every question we answer to our “SMB Knowledge Base,” which we think of as a product and technical spec for our entire platform. We’ve also built out a robust “how to” section with actionable instructions for common tasks.

This investment in documentation has allowed people to be more self-sufficient in understanding the what and why of our systems. Plus, the act of writing out long sets of instructions has highlighted parts of our system that are overly complicated and candidates for automation and streamlining.

4. Create opportunities for in-person bonding

Enigma employees frequently cite that one of their favorite things about working here is “the people.” Through the pandemic we lost many of the in-office rituals we had developed, like Friday team beers, weekly board game night, and team lunches with new joiners.

Despite our best efforts, reenacting these activities virtually resulted in an unsatisfying facsimile. There’s just no substitute for spending time together.

We make it a priority to bring the team together in person at least twice a year (for folks who are comfortable). These are work-lite, social-heavy affairs that allow old friends to reunite and new connections to be formed. In 2021, the engineering and data science team gathered for retreats in upstate New York and Savannah, Georgia.

5. Actively monitor work-life balance

In a remote setting, it’s harder to get a read on how engineers and teams are doing professionally and personally. In the office, you can look around and see who’s been heads-down the whole day and who’s there every evening when you leave. There are dozens of micro-interactions each week, like short conversations in the kitchen, where tone and body language offer clues about a person’s wellbeing and mental state.

These cues are less obvious in the remote world. As a result, we’ve had to become more intentional about establishing norms to promote a healthy work-life balance. Small changes, such as asking team members to queue up after-hours emails and Slack messages to send the next morning, and conducting regular PTO audits to encourage people to take at least one week of PTO per quarter, have made a difference.

Most importantly, the investments we make in planning help to set expectations and make our work more predictable so people can more fully enjoy their lives outside of work.

In 2021, we delivered projects of a higher quality at a faster, more sustainable pace than at any point in Enigma’s history—including when we were co-located in an office.

In retrospect, the involuntary transition to remote work – and the effects on team communication that come along with it – has been a blessing, forcing us to think carefully about how we work and find new ways to collaborate. There’s certainly more to learn and we’re excited about getting even better in 2022.

The investments we make in planning help to set expectations and make our work more predictable, so people can more fully enjoy their lives outside of work. —Ryan Green, Chief Technology Officer, Enigma

How Data Is Transforming the SMB Lending Landscape

Enigma — Wed, 02 Mar 2022 00:00:00 GMT

There are 32 million small businesses in the United States. Half of the U.S. workforce owns or is employed by a small business. And small business owners create two-thirds of net new jobs annually.

Yet despite their importance to the economy, small business owners are chronically underserved regarding access to capital.

But now more than ever, financial institutions are using new sources of data to better identify and serve more small and medium businesses (SMBs) without increasing risk.

For our panel discussion at Fearless in Fintech 2021, experts from across the financial ecosystem joined Madeline Ross, Enigma’s VP of Marketing, to share their perspective on the state of the SMB economy, how data is used in lending decisions, and why collaboration between banks and fintechs is an advantage to both lenders and borrowers:

Karen G. Mills, a Senior Fellow at Harvard Business School and President of MMP Group, served in President Obama's Cabinet as Administrator of the Small Business Administration from 2009-13. She is the author of Fintech, Small Business & the American Dream.
Laura Kornhauser, CEO and Co-Founder of Stratyfy, which uses machine learning solutions to help banks and credit unions optimize credit risk assessment and fraud detection.
Adam Bell, Senior Vice President of Digital Marketing at Customers Bank, which has approximately $20 billion in assets, locations in nine states and operates nationwide.

Here’s an overview of their discussion.

The SMB economy, by the numbers

For our State of SMB Economy Report, Enigma analyzed card revenues from a sample of more than 16 million U.S. businesses to measure the impact of the first 12 months of the pandemic on the SMB economy (starting in March/Q2 2020). We found that many small businesses saw a steep decline in sales over that period, with the lowest point in card revenues occurring in April 2020. Recovery was slow, especially in the restaurant and travel industries.

But there were signs of hope too: a boom in “pandemic entrepreneurs” with thriving new businesses. According to the Peterson Institute for International Economics, businesses launched between March 2020 and March 2021 increased by 23% over the same period between 2019 and 2020. In our sample, the survival rate for businesses launched in 2020 was approximately 46.5% higher than those launched in 2019.

Meanwhile, in small business lending:

Banks are Still Big Players

The lending landscape is evolving, with new entrants in the space and traditional banks using technology in new ways. According to the Federal Reserve, 42% of businesses that applied for financing in 2020 sought funding from a large bank, up just slightly from 40% in 2019. But 43% of businesses applied to a small bank, up from 36% in 2019.

SMBs Are a Smaller Part of Overall Bank Lending Than Ever

According to Mills’s book Fintech, Small Business & the American Dream, small business loans amounted to about 20% of banks’ lending in 2017, compared to 30% prior to the 2008 financial crisis.

The Needs of SMBs Remain Largely Unmet

Small business owners and entrepreneurs need access to a variety of credit sources. Short-term credit matters for the day-to-day management of cash flow, while longer-term credit is essential for capital investments. Yet less than half of small businesses report that their credit needs are met, according to a Federal Reserve survey. The Fed also reports that the share of applicant SMB firms that received all the financing they sought declined from 51% in 2019 to 37% in 2020.

The challenge of lending to SMBs and the rise of PPP

So why aren’t small businesses getting the funding they need?

Mills says there are a number of “frictions and barriers” that hinder small business lending. The most prominent of these are information opacity and heterogeneity.

“It's hard to see inside a small business and know if they're creditworthy, which is what we mean by 'information opacity,'” she explains. Making that determination requires more data points than banks can typically access or have the expertise to evaluate.

The second “friction,” heterogeneity, means that all small businesses are different.

“One day you're lending to a dry cleaner, the next to a funeral home, the next day to a cafe, the next day to a parts supplier,” says Mills. So it’s difficult for a lender to know what the credit profile for a creditworthy dry cleaner would look like.

But if that lender had credit data for 1,000 dry cleaners, they’d have an idea of what makes one risky or promising. It would be fairly simple to tell whether the 1,001st dry cleaner is worthy of lending to. That’s what big data can do — the sheer volume of information, which machines can parse much more quickly than people, can produce insights that reduce these historical frictions.

Data can thus “drive more access and opportunity, particularly for underserved small businesses [like] women-owned businesses and minority-owned businesses, which are most subject to these frictions and barriers,” Mills says, noting potential benefits to the economy including more diverse businesses and more jobs.

There are a number of 'frictions and barriers' that hinder small business lending. The most prominent are information opacity and heterogeneity. —Karen G. Mills, Senior Fellow at Harvard Business School and President, MMP Group

The Paycheck Protection Program (PPP) created a burst of small business lending at a frenzied pace, and even banks experienced in serving small businesses rushed to keep up.

Customers Bank participated in nearly 350,000 PPP loans with a value of $9.6 billion. But the bank’s approach to PPP was very different by the time the program ended in May 2021.

Bell said the Customers team wanted to enable access to PPP loans for as many clients as possible. To do that, the bank needed to compete on a bigger scale — and be nimble enough to do things differently.

That nimbleness requires “pipes of data,” Bell notes. So the bank partnered with fintech companies, including Enigma, to rapidly parse that data and efficiently process applications — enabling even the smallest businesses, like sole proprietors, to get better access to PPP funding.

Data as a solution: More ‘yes’ — even to the smallest borrowers

Data is only useful to financial institutions if the people who work for those institutions can use it.

“It's universally agreed that data has a ton of value, but extracting that value can be quite challenging,” says Kornhauser. “Ensuring that lenders, whether fintech or bank lenders, have the right technology in place to be able to extract the information they need from the data is hugely important.”

That’s why collaboration across a number of “core competencies and expertise areas” — like bank lenders and fintechs — is what’s necessary to solve this problem. In other words, it’s not enough to gather the data, categorize it and put it in context; lenders need the tools to analyze it.

“It's not just figuring out who's creditworthy, it's doing it for the smallest borrowers,” says Mills. “The people who have gotten left out are not just the underserved, but the small-dollar borrowers.”

She notes that about 75% of small business borrowers want a loan of less than $150,000. But loans of less than $100,000 don’t represent significant revenue for most banks. It isn’t economical to assign a banker to a small business owner who seeks a $7,000 loan — the potential profit just isn’t that significant.

But there's a big demand for small-dollar loans, says Mills. And companies like commerce solution Square have found success meeting that demand. Square’s average loan is $6,000, which it is able to do because it uses an automated system to approve or disapprove applicants quickly.

The average PPP loan disbursed by Customers Bank was under $30,000 — which “was an eye-opener for me,” says Bell. “To take on a loan size of $30,000 or less to a mass audience ... and to do it with speed, is quite a task.”

He agrees that the sub-$100,000 level is “a loan size that is often forgotten; sometimes small business owners just need a little bit to get to that next level ... [or] keep the lights on.”

It's not just figuring out who's creditworthy, it's doing it for the smallest borrowers. The people who have gotten left out are not just the underserved, but the small-dollar borrowers. —Karen G. Mills, Senior Fellow at Harvard Business School and President, MMP Group

Transparency, trust and truth

As technology enables more borrowers to access capital, we hear a lot about “transparency” in lending terms, risk modeling, and data itself.

Mills says “transparency in all forms of lending is crucial, not just for the lenders and borrowers, but also for external parties like regulators that are involved in these markets.”

Kornhauser cautions the industry to remember that defining what we mean by transparency in lending is important. We need to ensure that transparency “goes beyond just folks that are really comfortable with and familiar with data” and aim instead toward democratizing access to and knowledge of how loan approval decisions are made — especially if that loan decision is made by an algorithm instead of a banker.

Mills says it’s also critical for the industry to ensure financial data has integrity.

“This is a place where I've really been delighted to get engaged with Enigma and see the quality of the data they’re bringing to the data aggregation world,” she adds. “Because that data can't just be anything. Data is tricky stuff, and you have to be sure that whatever you think it's representing, it is truthfully representing.”

When the PPP program rolled out, stories about who deserved those loans, who actually got them and those who committed fraud were all over the media. Does technology help or hinder financial institutions in identifying malicious actors?

Ultimately, “trustworthy, accurate, truthful data, like the data Enigma provides, is a huge piece of addressing this problem and doing so proactively,” says Kornhauser, who also thinks we need “human-plus-data ... Especially when it comes to issues of fraud, there are certain things people would be able to sniff out before data alone could tell us.”

We must ensure transparency goes beyond folks that are familiar with data and aim instead toward democratizing access to and knowledge of the loan approval process — especially if it's done by an algorithm instead of a banker. —Laura Kornhauser, CEO & Co-Founder, Stratyfy

Advice for SMB lenders: ‘Move faster’

What would the experts suggest for lenders looking to become more data-centric?

“Just get started,” says Kornhauser, who thinks many lenders can get overwhelmed by the process of adopting new technology. But even “small, incremental changes can be really meaningful and impactful, and can snowball in all the best ways.”

Bell says that he and his colleagues at Customers Bank know they need to keep iterating and innovating.

“One check is not enough for a lot of these businesses,” he explains. “So how can we start to offer multiple products or services that accommodate those that need it the most? … Thinking into the future has to happen, while we're still standing up some of these environments and data assets.”

Mills thinks lenders should “move faster — because this is a difficult problem, but this is not rocket science.”

According to Mills, technology will enable lenders to “look inside a small business and see whether they're creditworthy,” while the business themselves will benefit from “an accurate forecast of their cash flow.”

While data promises to continue to be “transformative” for businesses’ access to credit, the bigger picture is even more revolutionary, Mills adds.

“We'll have more opportunity in our economy for more people to have the American dream.”

-------------------------------------------------------

This article is based on a panel discussion from the December 2021 Fearless in Fintech conference. Watch the full panel discussion.

KYB Requirements Checklist: What Data You Need and How to Collect It

Enigma — Tue, 01 Mar 2022 00:00:00 GMT

Know Your Business (KYB) is the legal requirement for financial institutions to verify the identity of every business customer before doing business with them — and to keep verifying over time. The goal is to prevent working with entities involved in money laundering, fraud, or other financial crimes.

If you're a bank, fintech, online marketplace, or any other financial institution that works with business customers, the CDD Rule requires you to:

Verify the business's identity
Verify the identity of that business's managers and owners
Monitor and track risk of that business over time

This checklist covers the specific data you need to collect to meet those requirements, and the three paths available for collecting it.

The data you need to collect

Know Your Business

To verify the identity of a business entity, you need to collect:

[ ] The business's legal name and any aliases or DBA names
[ ] The business's registered and operating addresses
[ ] Proof of active registration (typically via Secretary of State filing)
[ ] Whether the business conducts activity in a high-risk category
[ ] Whether the business appears on any sanctions or watchlists

Each of these items has compliance implications. Registration status needs to be current — a business that was in good standing at onboarding may fall out of good standing later. High-risk activity classification covers categories like cannabis, adult entertainment, gambling, firearms, money transfer services, and others that require additional scrutiny or may be outside your acceptable risk appetite. Watchlist screening typically means running against OFAC lists, which are updated regularly.

Know Your Owners

In addition to the business entity itself, you must collect data on each Ultimate Beneficial Owner (UBO) — defined as any individual with more than a 25% ownership or voting stake in the company — and at least one person who holds significant managerial control. For each of those individuals, you need:

[ ] Full name, date of birth, address, and SSN or TIN
[ ] Whether the owner appears on any crime or sanctions watchlists

KYB regulations allow institutions to trust self-reported UBO information from businesses unless they have specific reason to doubt it. One trigger for doubt: when an owner's name appears in KYB data but doesn't match the owner listed on the business application. When discrepancies like this appear, they need to be resolved before onboarding.

Three ways to collect and use this data

Once you know what data you need, the next question is how to find it, aggregate it, and use it to make onboarding decisions. There are three approaches.

Option 1: In-house KYB

Many smaller financial institutions build their KYB processes entirely in-house, rather than relying on a specialized external provider.

[ ] Invest — Build an auto-approval infrastructure in-house, build a manual review team, or both.
[ ] Verify businesses — Pull name, address, registration status, and SoS filing details through internal data infrastructure. Screen against the OFAC list.
[ ] Verify UBOs — Pull UBO data from SoS filings where available. (Note: UBO information is sometimes present in SoS filings and sometimes not. FinCEN has been working toward a centralized UBO database, but availability varies.) Screen UBOs against the OFAC list.
[ ] Monitor over time — The CDD Rule mandates ongoing monitoring. In-house programs need bespoke methods to update customer records and re-verify businesses based on their risk profile.

In-house KYB offers full control and may make sense for institutions with simple, low-volume programs. For anything more complex, the operational overhead tends to be high and auto-approval rates tend to lag what specialized providers can achieve.

Option 2: Single outsourced service and data provider

Some institutions work with one external service and data partner — either supplementing an existing in-house program or replacing it entirely.

[ ] Invest — Pay a setup fee and annual licensing fee to access the provider's data on an ongoing basis.
[ ] Verify businesses — The provider auto-approves businesses that meet verification criteria and flags others: businesses without an SoS filing, those with mismatched names or addresses, businesses in high-risk industries, and potential OFAC matches. Flagged businesses go to manual review — handled in-house or through an additional manual review service the provider offers.
[ ] Verify UBOs — The provider pulls UBO data from SoS filings and screens against the OFAC list.
[ ] Monitor over time — The provider periodically re-checks SoS registration statuses, re-screens for risky activities, and re-screens against the OFAC list.
[ ] Establish trust — Validate provider accuracy by periodically sampling auto-approved businesses to confirm approvals are correct.

A single provider handles most of the heavy lifting while still allowing meaningful customization. Companies using Enigma as their sole KYB provider are estimated to reduce KYB costs by up to 80%.

Option 3: Waterfall multiple data providers

Some institutions work with multiple data providers through a third-party data aggregation platform — "waterfalling" business applications through a sequence of providers until a match is found. Platforms like Alloy or Oscilar connect multiple providers into a single KYB decisioning endpoint.

[ ] Invest — Pay for a data aggregation platform that uses multiple data sources for auto-approvals. Manual review still needs a separate solution.
[ ] Verify businesses — The platform tries to verify the business using the first provider in the sequence. If that provider can't match it, the application moves to the next, and so on. This typically produces higher match rates and broader coverage of risky activity data than any single provider alone. Unmatched businesses go to manual review.
[ ] Verify UBOs — The platform uses data from multiple providers to verify UBOs. It also screens UBOs against the OFAC list. When UBO names differ between the SoS filing and the application, the discrepancy must be resolved before approval.
[ ] Monitor over time — The platform periodically checks SoS statuses, re-screens for risky activities, and re-screens against the OFAC list.
[ ] Establish trust — Run monthly checks on individual data sources to confirm auto-approval accuracy.

The waterfall approach maximizes coverage and is the most adaptable to changing legislation — new data sources can be added as requirements evolve without rearchitecting the whole program.

Choosing the right approach

The right data collection approach depends on your volume, risk profile, and how much you want to own versus outsource.

	In-House	Single Provider	Waterfalled
Best for	Low-volume, simple programs	Most institutions	High-volume or complex programs
Auto-approval coverage	Limited	High	Highest
Cost	High overhead	Up to 80% savings	Additional 50% on top of single-provider savings
Regulatory flexibility	Limited	Medium	High

For a deeper look at the trade-offs involved in each approach — including input from industry CPOs at Alloy and IDology — read A Guide to Optimizing Your KYB Process. And if you're evaluating what a single KYB data provider can do for your approval rates, see what Enigma KYB delivers.

Ready to put this checklist to work? Learn more about Enigma KYB or reach out to the team to talk through your specific requirements.

A Guide to Card Transaction Data

Enigma — Wed, 16 Feb 2022 00:00:00 GMT

Data about debit and credit card transactions can be a powerful tool for understanding consumer spending trends.

Eight in 10 Americans report they have at least one credit card, and there were more than 511 million active consumer credit cards in the United States in Q1 2020.

The pandemic accelerated movement away from cash: according to McKinsey, by the end of 2020, U.S. consumers used cash for just 28% of transactions, compared to 51% a decade prior.

What is card transaction data?

“Card transaction data” typically refers to data generated when a credit card is used to purchase goods and services from a business. To protect privacy, individual card holders are anonymized and transactions are aggregated.

But card transaction data can include more than just consumer credit cards. The card data can be derived from all kinds of cards, including debit cards, small business cards, corporate cards, and charge cards. The data can also include digital transactions, also known as “card not present” transactions.

Where does card transaction data come from?

Transaction data provided by data companies can come from a variety of sources. Data may come from a bank integration. Data can also be aggregated by a card issuer, a credit card network, or a payment processor at the point of sale.

When working with transaction data, it’s crucial to understand what kind of source it comes from. Many sources may skew towards certain groups of consumers, geographic areas, or types of transactions. Knowing the size of the sample and any biases in the data source enables you to better understand how to derive trustworthy insights from the data.

Raw transaction data is notoriously difficult to analyze. The challenge: in its raw form, the data is messy, inconsistent, and sometimes duplicative, requiring organization and cleanup at scale before it’s ready to tap for insights.

Here’s an example. Below, looking at raw transaction data, in blue, for the Bodhi Leaf coffee shop in Orange, California, we see that different payment processors refer to the same business as “Bodhi L,” “Bodhi Leaf Coffee,” “Bodhi Leaf Coffee Traders,” “Bodhi Leaf Trading Company,” and “Bodhi Leaf Tradi.”

Uniting this data into a holistic view of transactions at a business level requires sophisticated algorithms and entity resolution techniques to clean and match the data.

How you can use card transaction data as an indicator of business performance

Historically, card transaction data analytics has been used as a bellwether for consumer trends. When aggregated at the cardholder level, this data helps both marketers and government agencies understand buyer preferences and macro economic trends.

Recently, however, it’s been recognized that card spending data can also provide valuable insights about the health of a business. Looking at trends in card revenues, transaction volumes, and customer concentrations can reveal whether a business is growing or declining. When aggregated by business, this data is often referred to as “merchant transaction data.”

Card revenue does not reflect all of a business’s revenue, but COVID-19 has accelerated the trend of consumers using cards over cash. Merchant transaction data is especially helpful for businesses operating in industries where a high proportion of transactions are made by card, for example retail shops, restaurants, and service providers.

Merchant transaction data can help multiple teams at an organization:

Underwriting teams incorporate card revenue and transaction trends into their models for more accurate setting of initial credit limits.
Risk teams use fresh monthly revenue data to monitor the health of their customer portfolios and mitigate potential damage before it occurs.
Marketing teams use revenue and transaction data to improve lead segmentation and scoring, as well as purge their lead databases of closed businesses to improve campaign ROI.
Sales teams use revenue trends to identify fast-growing businesses and prioritize their prospecting targets.

What to consider when selecting a transaction data source

When evaluating a card transaction dataset, asking the right questions can help you compare the options and understand which dataset best suits your needs.

What is the latency?

How fresh is the data? How frequently is it updated?

What is the coverage?

How many cards are included in the panel? Is it just credit cards or debit cards as well? How many businesses are covered in the dataset?

What is the bias of the credit card panel?

What is the scope of the panel? Is it just Visa or just Mastercard? Is it skewed to certain geographies or income classes?

How can I use it?

Some data providers may require you to get permission from a business before accessing its transaction trends. Others, like Enigma, have already integrated privacy protection into their system so that you can immediately access data about any business.

Interested in learning more about Enigma’s Merchant Transaction Signals? Get in touch for a demo.

Did NYC Bounce Back in 2021? Manhattan vs. Brooklyn

Enigma — Thu, 03 Feb 2022 00:00:00 GMT

While the latest COVID-19 variant and cascading supply chain challenges continue to throttle the U.S. economy, a number of sectors have managed to navigate the pandemic remarkably well. In New York, the first American hub to have been deeply hit by the pandemic, recovery is well on its way; across some industries, the comeback dwarfs the first positive signs of resurgence in the summer of 2020.

Using our business entity-level data on credit card transaction amounts, we explored the difference between industries in Manhattan and Brooklyn, two of the city's best-known boroughs. Here’s what we found.

Growth Industries By Zip Code: Brooklyn and Manhattan

The map below depicts the industries with the largest year-on-year growth (June, July, and August 2020 vs. those same summer months in 2021) across Manhattan and Brooklyn zip codes.

Hovering on the map will display the zip code and the highest-growth industry in that area, in addition to selecting the other zip codes also saw that industry on top.

Note both the differences between the two boroughs — personal care services are best-performers in Manhattan alone, for example, while leisure facilities are noteworthy in Brooklyn — and within each of the boroughs themselves.

Growth Trends by Borough

We also assessed the highest growth industries in each borough in the charts below.

Indeed, leisure facilities saw more than double the transaction amounts in summer 2021 relative to the year prior.

In Manhattan, like in Brooklyn, the top spot was claimed by tourism-related businesses, with hobby, toy, and game stores taking second place.

An interesting trend to note: we saw parallel growth focused on the home front, with the rise of pandemic-era home improvement, and also out on the town: a return to enjoying the food and entertainment that the city has to offer (though the latter seems to be winning).

For more insights on pandemic effects and small business recovery, explore the State of the SMB Economy Report.

Methodology

We combined NAICS code classification data for over 75,000 businesses locations for all businesses in Enigma’s database located within New York state's Kings and New York counties to create a list of businesses located in Manhattan and Brooklyn. Using our Merchant Transaction Signals we then calculated their year-on-year growth in the summer of 2021 (average monthly transactions across June, July, & August 2021 vs. average monthly transactions for June, July, & August 2020).

For our map, an industry was required to have at least 5 businesses at the zip code level in order to include it as a high-growth category. For the borough-level comparisons, we set a floor of 50 businesses per industry. The total sample included over 75,000 enterprises.

Would granular industry data be helpful for you? Give ours a try with a free sign-up for Enigma Console.

Retreat Recap: Technical Teams Convene in Savannah

Enigma — Wed, 26 Jan 2022 00:00:00 GMT

It was an early Wednesday morning, November 2021. My heart bursting with excitement, I got out of bed to check my phone. Thirty minutes until the Uber arrived to take me to Newark Airport. It was Day 1 of Enigma’s Engineering/Data Science team retreat in Savannah, GA.

I started my new job at Enigma Technologies as a Technical Recruiter mid-October, focused on scouting engineering and data science talent for the company. My own remote onboarding process meant I didn’t get the chance to meet my co-workers face to face, so I was thrilled to be participating in the retreat, one way Enigma was bringing employees together safely in a newly hybrid/remote environment.

“We are a remote-first company and we work extremely effectively as a remote team,” said Ryan Green, chief technology officer at Enigma. “However, I think it’s important for people’s wellbeing that we come together and have that face-to-face bonding time, working on problems together and having unstructured time for fun activities and getting to know each other.”

By the time the plane boarded, I had already met three of my coworkers: my supervisor, Brian; Moon, an engineer; and Jon, a data scientist. By the time we landed in Savannah, two more had joined, and our group of five headed to the hotel, our retreat headquarters.

IRL: Breaking the Ice

By afternoon everyone was checked into their rooms and we began meeting up in the hotel lobby lounge. In our group of 18, there were a lot of new faces, including mine, so Ryan suggested we do an ice breaker: guessing a person based on their three interesting facts. This worked. Even the slightest bit of awkwardness was completely gone by the time dinner rolled around. Everyone was talking and laughing, sparking up connections that just couldn’t happen talking remotely, screen to screen.

“As my first time in the US, coming from Morocco, I was expecting to have fun and good conversations with people. It was far better than I expected,” said engineer Said Mancouri. “Meeting my co-workers was one of the best things that happened for me this year.”

After a night of fun, good food and drinks, we all gathered the next morning for a strategy workshop. We brainstormed and pitched ideas on how to build awareness about Enigma and our engineering and data science teams, like hosting technical meet-ups or webinars with leaders of different data companies.

In the afternoon, we ventured out on a food tour of Savannah, visiting restaurants, speakeasies and local shops that told the history of the city through food, drinks and memorabilia. This was my first time visiting the South, and the vibrant culture and stunning architecture blew me away. Fun fact from the tour: the moss that hangs from trees in Savannah was named “Spanish moss” by French explorers because it reminded them of Spanish conquistadors’ long beards.

After the tour, everyone had a few hours to rest. Then the real fun—in my opinion—began.

Meeting my co-workers was one of the best things that happened for me this year. –Said Mancouri, engineer

Competing in the “SMB Challenge”

Ryan created a game called “SMB (small to medium sized businesses) Challenge”—a scavenger hunt with a twist. We were divided into groups of four and we’d have one hour. The goal was to visit local shops, make a purchase to get a receipt for proof, then find out the shop’s revenues using Enigma’s API console. The team whose list of shops had the highest total revenues would win.

Once the clock began everyone was focused and ready to win. Some teams started with research first, opening Google Maps and Enigma’s Console to make a targeted route of high-revenue businesses. Other teams were quick to their feet, heading out to visit as many shops as they could.

This sparked a friendly competition between all of us, and in the end, the winning team’s list totaled more than $70 million in collected revenues, on average beating every other team by threefold.

Engineer Mikhail Pechagin was a member of the winning team. “I work on the Applied Tech team and focus on compliance solutions, so I do not get much time to interact with Enigma’s core product. It was nice to learn about it in action,” he said. “To hit the most revenue for the challenge, we checked for the most expensive looking places in the area, while spending as little as we could in each place. We assumed that an expensive looking place will require a lot of revenue to stay afloat.”

“The SMB Challenge opened up the perspective of how Enigma’s data could be used for consumer-focused use cases,” said engineer Osasu Eboh. “Like the way Google Maps gives you a complete overview of a business—Enigma’s data could be used to enrich that kind of summary.”

Competition was fierce but stayed friendly. What we gathered from this challenge was not only a deeper understanding of our own product, but a chance to practice teamwork and understand one another better.

“I believe collaboration, teamwork and support are our super powers here at Enigma, and we can operate so much more effectively because of this level of trust that we have,” Ryan said. “The vision for the Engineering and Data Science team is to build up the organization while keeping these pieces that make us extremely special.”

Carrying the Energy Home

I think retreat attendees would agree: the goal that Ryan set for this trip—to become closer as coworkers—was achieved. And that culture of collaboration and shared mission carries on once the team returns to working remotely.

“Engineering at Enigma focuses on long term impacts that our initiatives can make,” said engineer Moon Kang. “Through collaborations with stakeholders we constantly identify what customers look for and deliver with the highest quality of code.”

“Engineers here want to see each other succeed,” said Osasu Eboh. “There have been multiple occasions where engineers are providing feedback into other’s proposals and everyone is eager to help whenever someone is having issues.”

“The amount of autonomy provided to us at Enigma lets us make our own decisions and implement them, which is more than we get at other companies,” Mikhail Pechagin added. “Also, there is a great culture for work/life/health balance. Managers and leads look out for engineers by checking on how they are doing outside of meeting work goals.”

I boarded my flight early Friday morning feeling a little sad to be leaving, but excited knowing that there will be other fun company events in the near future. More importantly, I felt I had built closer relationships with my supervisor Brian, coworkers and Ryan.

And as a recruiter for Enigma, I can’t wait to help future Enigmites have a similar experience.

If this sounds like your kind of culture, check out our job openings. We’re always looking for talented engineers, data scientists, and more.

Engineers here want to see each other succeed. There have been multiple occasions where engineers are providing feedback into other’s proposals and everyone is eager to help whenever someone is having issues. –Osasu Eboh, engineer

Meet Prime

Enigma — Wed, 12 Jan 2022 00:00:00 GMT

Prime: Lending as a Service for Community Banks

In the past ten years, a new generation of companies has ushered in an unprecedented era of innovation for small businesses.

Companies like Stripe and Square abstracted a decades-old payments infrastructure and made it into composable software for both online and in the real world. Shopify has small businesses finally competing with the behemoth of Amazon (in the second quarter of 2021, Shopify sites traffic surpasses Amazon's). Flexport is transforming logistics and global trade – even leading policy conversations in emergency supply chain failure response management.

Small business owners are now better equipped than ever with the tools to scale – except when it comes to capital.

The 30 million enterprises that make up the small business economy are fundamentally diverse. Their capital needs are different from each other and big businesses; very few are those who know how to service them at scale. According to the Fed, 56% of small businesses do not have their financing needs met. Karen Mills, President Obama’s Small Business Administration chief during the 2008 financial crisis, highlighted cautiously in her book that small business lending by banks was 20% in 2017 compared to 30% pre-crisis.

Covid-19 was the five-alarm fire for the small business economy. Not just a matter of massive credit crunch, large segments of the economy were forcibly shut down, with only uncertainty looming on the horizon. The government spent close to a trillion dollars in the Paycheck Protection Program (PPP) bailout, which, though fraught with a faulty start and allegations of fraud, was ultimately extremely impactful to the survival of millions of businesses.

PPP was also, from a data perspective, a unique opportunity to analyze the landscape of small business and the financial services ecosystem that serves it. An immediate observation was the sheer efficiency of credit unions and community banks in processing PPP loans. We learned that small and nimble community banks are scaling their operations rapidly and keen on adopting fintech. An ecosystem of SaaS offerings that go as deep as core banking rails is helping community banks integrate into the modern financial ecosystem.

At Enigma, we focus on providing transparent data on the identity and health of millions of US businesses. We are modeling the SMB economy merchant by merchant. And we know that many of them are thriving, yet unable to access the capital they would need to invest and grow their business. Over the year, we started working with community banks and became fascinated by how our intelligence on small and medium businesses could better serve them. It was clear from almost every conversation that the biggest gaps were in lending.

We decided that we could start solving this problem by building lending-as-a-service infrastructure across the lifecycle of the loan for a community bank. From origination, underwriting, and risk, to servicing and securitization — there exist along this chain many opportunities to democratize best-in-class loan infrastructure.

Very early on we realized that we would build this as a spin-off. At Enigma, we are, at heart and in our culture, always a data company. Our strategy is to be maniacally focused on data quality, attribute by attribute, entity by entity. We believe it’s very important that data companies not build overly verticalized products, or else we would be competing with an extremely rich and rapidly growing SaaS ecosystem of purpose/domain built applications.

Enigma provides intelligence and context through our dataset, and we stay highly attuned to where and how our data is being used. In the community banking branch of the ecosystem, the gap to deliver automated intelligence was too big. The opportunity was clear, and Prime—a company to be founded by Enigma—was born.

Along our journey to spin off Prime, we were lucky to have been already working with leaders in finance and technology. Each with deep expertise and passion, and an understanding that there is much more work to do to unlock growth in small and medium businesses:

Capital One, a leader in small business credit cards. A current customer of Enigma’s data, Capital One has seen firsthand how better data about small businesses leads to growth.
Third Point, an investor in Enigma and one of the early investors in the revolutionary consumer lending platforms Upstart and SoFi. Third Point was pivotal in building the core business and developing the securitization programs, and will bring that expertise to Prime.
Customers Bank, a super-community bank that emerged as one of the largest lenders of the Paycheck Protection Program. True to its “high touch, high tech” value, Customers Bank is a pioneer shaping the future of small business banking, evidenced by how they’ve embraced the possibilities of fintech and cryptocurrency.
NEA, a global venture capital firm and early investor in Enigma. NEA has partnered with founders on initial go-to-market strategy for many successful platforms like Jet.com and Coursera.

Today we are announcing that Prime has received initial funding of $49M to build out its core team and product offerings. I will act as interim CEO of Prime until we transition Prime over to a new leadership team over the next two quarters.

We are excited about what the months ahead will look like, knowing Prime’s mission is to unlock growth and lending for all kinds of communities. We have been quite busy at Enigma on all fronts, and it’s been exciting to see how focusing on data has allowed us to be more creative in how we approach partnerships in our ecosystem writ large.

-Hicham Oudghiri

Chief Executive Officer

Enigma Technologies Inc

Q&A: Paige Graham of Paige’s Candle Co.

Enigma — Tue, 30 Nov 2021 00:00:00 GMT

What does access to capital mean for today’s small businesses, and what’s it been like to operate through the pandemic?

The Enigma Blog sat down with Paige Graham, founder and creative director of Brooklyn, New York-based Paige’s Candle Company, to get her perspective.

Enigma Blog (EB): Tell us about your small business.

Paige Graham (PG): Paige’s Candle Company is a homegoods business with a specialty in vegan soy wax candles. We sell our candles in retail stores and online, and we also do custom candle creation for brands.

We approach candle-making a little differently. Rather than using readymade fragrances, we create our own by blending essential and natural oils in top, middle, and base layers to create custom scents.

We also offer community workshops to provide an outlet in the arts for New Yorkers, especially low-income young people.

EB: How do you decide what becomes a new candle scent?

PG: We want our candles to create an experience, and I get inspiration from friends, family, and nature.

For example, with our wild grass candle, I wanted to make a scent that allowed a friend of mine who’s severely allergic to grass to experience that part of nature.

Another example: fireplaces create such a cozy atmosphere, but they’re rare in New York City. So we created a firewood candle to recreate that fireplace experience.

EB: How has the pandemic affected Paige’s Candle Company?

PG: It has been both a blessing and incredibly stressful.

On one hand, we saw an increase in sales, with more people spending time at home and looking to create a relaxing environment.

On the other hand, many markets around the city shut down, so we lost a lot of our usual revenue. The pandemic also created supply chain challenges. We’re still seeing a shortage of jars in the U.S., with delays of 7-10 business days on supply orders. Prices went up 25% on soy wax, one of our primary raw materials. There has also been more competition from other companies for natural, raw ingredients, as consumers have become more aware of what they’re putting in and around their bodies.

EB: What’s been your experience accessing capital as a small business?

PG: During the pandemic we applied for and received a PPP loan. That was fantastic – of course forgivable loans are wonderful. I worked through an alternative lender and it went very smoothly. I think in part that was because the company’s books were in order.

I’ve tried to avoid other types of loans because of high interest rates. But it can be really hard to do without, especially with issues like the supply chain back ups. I’ve found that platforms like Shopify have been good funding options: easy to use with long payback timelines.

EB: What does capital mean for your business?

PG: It lets me take a big sigh of relief! That was especially the case during the pandemic.

Revenues took a hit without all the city markets. For myself, I knew I could cut back to eating toast with mayo if I had to. But getting the PPP loan meant I could keep my staff, that they’d be OK. And Big Bertha too [Big Bertha is the company’s lovingly named industrial wax melter].

When I foresaw a shortage of jars on the horizon, capital also meant that I was able to stockpile extra pallets to keep as inventory. But even with that precaution, we’re still running low on jars. Our workaround has been to diversify our products and the vessels we offer for candles. And that requires capital, too.

And recently capital allowed us to buy a second wax melter (already nicknamed “Big Betty”), which will increase our production capacity as we head into a busy season.

Recently capital allowed us to buy a second wax melter, which will increase our production capacity as we head into a busy season. —Paige Graham, Founder & Creative Director, Paige's Candle Co.

EB: How would you grow the business if money were no object?

PG: I’d invest in more equipment and a larger production space. I’d work through distributors, which can be expensive. And I’d also find the perfect permanent event space to host our community candle workshops.

EB: If you could snap your fingers and fix one of the challenges of running a small business, what would it be?

PG: Definitely to fix the current supply chain issues. But I’d say a few others, too.

I’d also perfect my time management, to give myself more time to oversee all the different aspects of my business, like retail, online, events, and specialty orders. Finally, the Small Business Administration has been a great resource for launching a business, but the information tapers off once you’re established. I’d like instant access to more information for semi-mature small businesses: how do we grow and expand?

For more insights on the small business economy, explore our State of the SMB Economy Report.

Rise of the “Pandemic Entrepreneur”

Enigma — Wed, 17 Nov 2021 00:00:00 GMT

Through the pandemic we’ve witnessed the outsized impact the pandemic has had on small businesses — conditions that spurred government responses like the Paycheck Protection Program to help the segment survive.

But in all the sobering findings, there’s a surprising bright spot: the rise of “pandemic entrepreneurs.”

A pandemic-business boom

Researchers have estimated that there was a 23% uptick in U.S. business start-ups in 2020, compared to 2019. And these enterprises are doing better, too. Our recent State of the SMB Economy Report finds that businesses launched during the pandemic have a survival rate about 46.5% higher than businesses started in 2019.

To better understand business performance, we compared average card revenues and operating status for businesses started in 2020 with businesses started in 2019. We found that businesses in our sample started in 2020 (the 2020 cohort) were more likely to still be operating the following year, compared to those started in 2019 (the 2019 cohort).

Of the 2019 cohort, 41.67% were still operating in Q1 2020, compared to 61.18% of the 2020 cohort that were still operating in Q1 2021.

Comparing monthly card revenues, the 2020 cohort had average monthly revenues 9% higher than the 2019 cohort.

What’s driving the spike?

Why this boom of pandemic startups that are outperforming fellow SMBs?

Some of them may serve needs that emerged from pandemic conditions. Others may offer products or services better suited to digital delivery. Still others may be long-standing “side hustles” that became a business owner’s full focus.

A side hustle story

Greg Fischer is one pandemic entrepreneur who turned his side hustle into a full-fledged business. In September 2020, Fischer and his business partner launched Burn Pit BBQ, a veteran-owned, Wisconsin-based business offering barbecue meat rubs, sauces, apparel, and content for meat grilling and smoking fans.

“We were already experimenting with homemade rubs and sauces as a hobby, and I had experience building side businesses, websites and products,” said Fischer. “The pandemic gave me a lot of time to reflect on my then-career. And at that point we noticed that more people were cooking from home. August 2020 felt like the right time for me to leave corporate America and focus on Burn Pit full time.”

The partners funded their launch with savings, going to market with a website and a few product offerings. Demand has steadily increased.

“We get feedback from consumers that they are looking to spend locally, shop small, and support veterans,” Fischer said. “On the B2B side, we’ve seen an uptick in orders for appreciation kits for remote employees and locally sourced gifts for other businesses.”

As a growing business with a need to stock more inventory, getting access to additional capital has been a challenge. “We had to start with a business credit card, which wasn’t ideal,” Fischer says. “Now we’re looking to open a line of credit through a national bank, and though our revenue growth is there, we can’t show the typical two years of history.”

If the national bank relationship doesn’t pan out, Fischer says he and his partner will explore other options like community banks and veterans organizations that offer grants and loans.

“If I had my choice to fix one challenge of running a small business, it’d be getting access to capital without bringing on more investors,” Fischer said.

The increase in pandemic startups is a positive sign that, despite challenges the pandemic forced upon the SMB segment, the entrepreneurial spirit is still alive and well. And those fledgling pandemic enterprises mean more opportunity for investment at each stage of their journeys.

For more insights on the current SMB economy, explore the full report.

If I had my choice to fix one challenge of running a small business, it’d be getting access to capital without bringing on more investors. —Greg Fischer, co-founder, Burn Pit BBQ

Methodology

We looked at a sample of 16 million businesses across the United States in our Merchant Transaction Signals dataset. For each of these businesses, we analyzed monthly card revenues based on aggregated credit and debit card transactions from a panel of 700 million anonymized cards.

We defined the beginning of the pandemic in the United States as March 2020, when Covid-19 was declared a national emergency. We looked at monthly gross revenues across all businesses from January 2017 to March 2021. Because card transactions can show high seasonality, all growth rates cited are year-over-year growth rates. To calculate year-over-year growth rates we use the formula (Month 2021 - Month 2020) / Month 2020. If transactions disappeared entirely for more than 3 months, we defined the business as having ceased operations.

3 Tips for Adding Validation into Your Data Science Workflow

Enigma — Fri, 15 Oct 2021 00:00:00 GMT

Validation is a critical part of working with data. In data science, it’s how we check our work across huge datasets, confirming that our output is accurate and high quality.

There’s no set formula for how to validate a dataset. It will depend on your company’s business model and stage of growth — making it tricky to find the right approach.

Our work in data science at Enigma drives top-line revenue. Validation isn’t just a ‘nice to have’ process behind the scenes: it’s a core part of our data science workflow. It’s important that we build and maintain a scalable validation process that we’re confident in.

Getting here has been a journey, and we continue to refine our processes.

How Our Validation Has Evolved

As Enigma transitioned into the small business data company we are today, our validation process has changed and matured.

As the dataset has continued to grow, new customers have been interested in different subsets of the data. So rather than tailoring validation by customer, the team began to experiment with building a consistent testing sample.

Here are a few key lessons we’ve picked up along our validation journey.

Tips for Adding Validation to Your Workflow

1. Keep a customer focus

From the beginning, keeping the customer experience front and center has been key.

Early on, when we’d release an updated dataset, the team looked at many of the indicators that customers are supposed to see in the data and generated hypotheses:

How many of these companies are detectable real companies?
How many should we have found?
How many could we have in our data, theoretically, but we just don’t?

Today, we validate during research. We define a sample that we’re comfortable with, making sure we’re validating based on what our clients are seeing, not just baseline distribution.

Having a customer focus also means validating as far towards the end of the pipeline as possible. Before anything moves into production we check the end of the line to make sure we’re seeing it through the eyes of our customer.

Expect that each customer in a B2B setting will vary in terms of the samples they’re interested in. It can help to use a poll of customers.

2. Make your validation repeatable

Repeatability will depend on your organization’s growth stage. If you’re a startup still working to achieve product market fit, it can be hard to know what is repeatable.

Approaching validation from scratch each time can mean you end up with wasted validation and have to revalidate. The customer market can also change, affecting your samples.

A major challenge with validation is we’re looking at hundreds of millions of pairs of data. Working across the entire data set wouldn’t be feasible for each validation. Instead we had to figure out how to cluster.

We’ve also evolved to introduce more automation into our validation process, like outsourced and automated data labeling.

3. Document, document, document

As you begin to crystallize a customer-focused, repeatable process, it’s important to invest in clear product documentation on key data science decisions.

Without documentation, you’ll get ad hoc decisions and internal contradictions. Efficiency begins with stringent guidelines that are documented and rooted in customer investigation.

And showing your work is also important for external stakeholders. We’re very open about our data models. Customers are using insights from our data to make decisions, so that transparency is important — it builds trust.

Let's Try Again: Making Retries Work With Cloud Services

Robert Grimm — Tue, 12 Oct 2021 00:00:00 GMT

Not surprisingly, Amazon's AWS enforces rate limits on their services. Their client libraries also incorporate automatic retries. They may allow an application to gracefully recover after exceeding those rate limits. But under heavy data volumes and with the AWS default retry strategy, a process can still trigger rate limits and fail despite retrying.

In this article, we will:

Review the basics of retries as a failure handling strategy.
Explain the above interaction with rate limits and its solution illustrated by Python source code.
Explore several other technical properties of effective retries.

Our journey begins a couple of months ago, when one of our data processing pipelines failed because AWS returned a 503 Slow Down error. After inspecting our logs and consulting Amazon's documentation, my team determined that a bulk copy between two S3 buckets exceeded Amazon's rate limits.

At this point, I was already implementing retries in my head, a challenge familiar from previous employers and projects. After all, retries — executing the same operation with the same arguments again, possibly after a short delay — are simple and sufficient for overcoming many (but not all) failures.

In fact, we constantly use retries in our everyday lives: think about asking somebody to repeat what they just said because face masks make it harder to understand people or swiping a credit or subway card again (and again) when the first swipe didn't register.

After digging a little deeper into Amazon's documentation, it turned out that all their client libraries already implement retries and automatically employ them when appropriate — so much for me implementing retries again. Under the default settings of boto3, the Python client for AWS, our pipeline stage didn't just fail once but failed five times on the same copy operation. Maybe retries aren't so simple after all.

Recent changes to AWS' implementation of retries suggest that AWS engineers might just share that sentiment. In February 2020, boto3 demoted the previous retry mode to “legacy” status (though it still is the default) and gained two new modes called “standard” and “adaptive”. As it turns out, the latter solves our problem. But the technical reasons aren't intuitively obvious.

This blog post starts out by explaining the basics of when to use retries. We then explore their interaction with rate limits and show the Python code for configuring boto3 to gracefully handle that case. Along the way, I’ll explain several other technical aspects of rate limits and retries. By the end of the article, you’ll have a solid understanding of both rate limits and retries that you can apply to your own work.

1. Only Try Retries When...

Retries differ from other techniques for providing fault tolerance such as data replication or distributed consensus in that they are strikingly simple: just do it — again! As a result, we don't need to go through complex algorithms for retries themselves (yay!). But we do need to review the three primary criteria for when we can use retries.

1.1 Operations Are Idempotent

Fundamentally, retrying is only safe if an operation is idempotent, i.e., repeated execution results in the exact same outcomes or state. If you were to retry

<div class="code-wrap"><code>transfer_money_from_me_to(recipient, amount)</code></div>

and your bank's API timed out, retrying would be ok if your request was dropped by the load balancer in your bank's frontend but not so much if the confirmation was lost in the tangle of your bank's microservices. The problem is that you couldn't tell a-priori and retrying this particular operation might just drain your account of all its funds. To put this differently, retrying is safe if and only if doing so eventually converges on the same global system state.

Conveniently, the protocol enabling the web, HTTP, was explicitly designed with retries in mind. They are part of the representational state transfer or REST model underlying the web. In particular, DELETE, GET, HEAD, and PUT are idempotent. GET and HEAD also are read-only. Only POST is not idempotent and thus not safe to retry. So when designing a REST API, you probably want to minimize the use of POST. Sure enough, AWS does just that. They expose their cloud services through a REST API and, out of 96 endpoints in the service description for S3, only six use POST.

The central role of retries for recovering from failures on the web is seen in browsers and services alike. Notably, it explains why browsers have a reload button in the first place. It also explains why most websites warn us not to use the reload button during payment processing, which typically is implemented as a POST. Finally, it explains why eventual consistency is the dominant consistency model across the web. After all, that's just what we get when combining idempotent operations with retries and also the reason I already used similar language in the above definition of safe retries.

That's all well and good. But what should we do about operations like the above endpoint to transfer money? To find a solution, it helps to ask how we distinguish seemingly indistinguishable events, such as birthdays or rent payments, as well as fungible packaged goods, such as cereal boxes and milk cartons. The answer is straightforward: We add another attribute based on date/time or a monotonically increasing counter. That certainly works for the above endpoint as well:

<div class="code-wrap"><code>transfer_money_from_me_to(recipient, amount, transaction_id)</code></div>

The bank doesn't know how a client generates such a transaction ID. That is entirely up to each individual client. But the bank does commit to performing only one transfer, even if several requests with the same transaction ID are submitted to the endpoint. It may also want to reject any request that shares a transaction ID with a previous request but differs in recipient or amount. This pattern can be found in many REST APIs. The corresponding operations may be named create_or_get_something.

1.2 Failures Are Transient

Even if it is safe to retry an operation, that doesn't necessarily mean we should actually retry the operation. We also have to consider the nature of the failure. For an example, let's consider our favorite banking endpoint again. If the bank returns an error indicating insufficient funds, retrying will only have the same result. The bank won't transfer non-existent funds no matter how often (or nicely) we ask. We first need to transfer funds into the account or open a line of credit.

However, if the request times out, we don't know the outcome of the request and should definitely try again. In short, retrying only makes sense if the failure is non-deterministic or transient. While it is safe to retry on any error, doing so for deterministic errors is pointless and only wastes resources.

Effective retries thus depend on the error reporting mechanism as well as the error classification policy. As far as mechanism is concerned, both server and client need to capture the concrete causes of failures and faithfully forward fine-grained error information. Whether an implementation uses error codes or exception types doesn't really matter, but it is critical that higher layers do not mask the error information captured by lower layers.

As far as policy is concerned, engineers need to inventory the specific errors occurring in a system and then classify them as retryable or not. A mechanized version of that classifier forms the core of the retry logic.

Judging by boto3's code repositories and commit histories on Github, AWS engineers significantly deepened their understanding of both retry policy and mechanism over the years. The original version in boto3's runtime botocore implements the so-called legacy mode. It dates back to April 2013 and comprises one module with 360 lines of well-documented Python code.

In contrast, the new version from February 2020 implements both the standard and adaptive modes. It comprises seven modules with 920 lines of well-documented Python code. The primary difference between legacy and standard modes is that the latter classifies many more errors as retryable and thus covers many more endpoints. At the same time, the standard mode also enforces a quota on active retries, thus protecting the client from being overwhelmed by retries in case of, say, a network or AWS outage (which are rare but do happen).

Systems Archeology I: I suspect that AWS engineers started out with retry logic even simpler than what is offered by the legacy mode. When a client exceeds rate limits, S3 returns a 503 HTTP status code with a non-standard reason 503 Slow Down. That's a curious choice of status code because the 5xx status codes indicate server errors, whereas exceeding rate limits are client errors, as indicated by a 4xx status code in HTTP. Furthermore, RFC 6585 defines a directly suitable alternative, 429 Too Many Requests. The standard definition of the 503 status code, 503 Service Unavailable, suggests the reason for this non-standard choice by Amazon engineers: That status code is commonly used when a server is overloaded, an obviously retryable error condition. Hence I wouldn't be surprised if earlier versions of AWS client libraries simply retried on 503 only and this was an expedient solution to increasing the reach of the retry logic.

1.3 Success Is Likely

In the introduction to this section, I breezed through one last major concern when I wrote “just do it — again,” leaving the timing entirely open. While the previous two subsections didn't mention timing, we nonetheless made one important observation: retrying when the likely outcome is another failure wastes resources. If we extrapolate from that, retrying in short succession over and over again is equally pointless.

Real world experience with children during long car journeys underlines that point. Pestering the parents by retrying incessant “Are we there yet?” chants does not bring the destination any closer. It may even unnerve the parents to the point of a forced hour-long break, thus delaying arrival. At least, that's how my parents describe past car trips for summer vacation.

The uncomfortable truth is that, in distributed systems, sending too many retries is fundamentally indistinguishable from a denial of service attack. Experienced service providers on the internet do not take kindly to such attacks. Instead they implement drastic countermeasures, up to and including dropping all requests originating from a suspect IP address in the firewall, before the request even gets to any application server. To avoid that, we need to pace ourselves when retrying. Best practice is to delay each retry, use exponential backoff for repeated retries, and to add some degree of variability by introducing randomized jitter.

In my experience, jitter is typically computed as a smallish delta on top of the deterministic exponential delay. Reading the source code for botocore, I discovered that it instead randomly scales the deterministic exponential delay between zero and the full value. I was surprised by that choice. As it turns out, AWS engineers carefully considered the interaction between exponential backoff and jitter, simulating several approaches, and found that “full jitter” was the most effective choice. Amazon's Marc Brooker wrote a great blog post about just that.

Are we there yet? — No!

2. Limiting the Rate of Tries and Retries

So far, we only considered retrying individual operations in isolation. In such cases, the above three criteria are sufficient for determining when and how to use retries. But when operations occur repeatedly, at more or less regular intervals, and loosely depend on each other, the three criteria aren't sufficient anymore.

Probably the loosest such dependency are rate limits, which do not impose any ordering constraints but do limit the overall number of operations per time period. They are critical for ensuring that a few overly aggressive or even malicious clients cannot overwhelm resources shared amongst many more clients. For that reason, they are pervasive across cloud services.

While it is possible to trigger rate limits reliably and predictably by sending a sufficiently high volume of requests per time period for long enough, triggering rate limits for a particular request is exceedingly hard if not impossible. That's good news: we can treat rate limit violations as non-deterministic errors that are subject to retries (of course, only if the operation also is idempotent). But retrying the failed operation does not suffice for making sustained progress. Instead, the rate limit violation serves as a signal that subsequent future operations are not currently welcome either — within some time period.

Are we there yet? — No!

2.1 The Token Bucket Algorithm

That time period is typically determined by the algorithm used for enforcing rate limits, with the default choice being the token bucket algorithm. It is simple enough to describe in a paragraph:

The name stems from the main data structure, a bucket holding fungible tokens, i.e., a counter. The server maintains a bucket per client and periodically adds a fixed amount of tokens into each bucket, up to some upper limit corresponding to the rate limit. When the server receives a request, it attempts to remove one or more tokens from the client's bucket, with the amount corresponding to the effort necessary for servicing the request. If there are sufficient tokens in the bucket, the server removes them and completes the request. If there aren't, it rejects the request with a rate limit violation.

Are we there yet? — Now we are!

2.2 How Our Pipeline Failed

Now we have enough context to explain the exact circumstances leading to the failure of our data processing pipeline. As mentioned in the introduction, the pipeline failed during a bulk copy between two S3 volumes. The data being copied is a very large dataset in Parquet's columnar format, which puts (parts of) columns into their own files organized by directories to represent tables.

Many of these directories contain many thousands of files. The particular directory that triggered the rate limit failure contained 4,003 files. At the same time, AWS limits clients to 3,500 DELETE, POST, or PUT requests and 5,000 GET or HEAD requests per second per prefix in a bucket. (“Object,” “prefix,” and “bucket” are official S3 lingo. I'm using “file,” “directory,” and “volume” interchangeably.) There are no limits on the number of prefixes in a bucket.

Since the bulk copy is implemented by copying individual objects and initiates those operations in a tight loop, it can easily exceed the above rate limits. Furthermore, since boto3's retry logic uses full jitter, i.e., randomizes the delay between 0 and the exponentially increasing duration, the retry may occur pretty much immediately after the failure, i.e., with a delay much smaller than the period AWS uses to track S3 rate limits.

But that means that the retry will exceed the same already exceeded rate limit and fail again. While the probability of the retry delay being much smaller than the rate limit period is small and getting smaller as the number of retries increases, the number of copies is sufficiently large so that this event may happen five times in a row. At that point, boto3's legacy retry mode stops retrying and fails the copy operation. That in turn fails our pipeline.

Systems Archeology II: In the first systems archeology above, I took note of an unusual choice of HTTP status code to speculate about earlier retry logic and an expedient engineering decision by AWS engineers. This time, I am basing my speculation on a note in the official documentation for S3. The section on “Optimizing Amazon S3 performance” states: “You no longer have to randomize prefix naming for performance.” If users had to randomize prefixes before, that implies that AWS was partitioning the data stored in buckets by some prefix of the prefix / directory name.

Nowadays, however, AWS can partition data at the granularity of a single prefix / directory name. Since it supports more flexible naming schemes, that certainly is a boon for users. At the same time, partitioning at such fine granularity must result in a massive maintenance state within S3's implementation. I suspect it is just for that reason that the same documentation section later states: “Amazon S3 automatically scales in response to sustained new request rates, dynamically optimizing performance.” In other words, S3 will partition a bucket by individual prefix, but only if necessary for a given access pattern. That is an impressive engineering achievement!

2.3 Fixing Our Pipeline

Conveniently, the carefully staged explanation of how exactly our pipeline failed also describes everything we need for avoiding just that kind of failure. Even more conveniently, AWS engineers already implemented the solution and then some — through the adaptive retry mode. The solution is based on the realization that, for bulk operations that exceed rate limits, it isn't sufficient to retry the failing request. Instead we need to slow down the entire bulk operation and throttle any future requests, i.e., issue them at a slower rate.

We already know how to do that: use the token bucket algorithm! The only difference when using the algorithm on the client for throttling is that we don't fail requests when there aren't enough tokens in the bucket, but rather we wait until there are enough tokens. That's just what boto3's adaptive retry mode does. Well, with one significant additional feature: iInstead of requiring that user code configures the maximum request rate, the adaptive retry mode automatically adjusts the maximum capacity of the bucket based on successful as well as unsuccessful outcomes.

I'm sure you are as excited as I was when I discovered the retry mode and are itching to enable it for your own Python-based data processing pipelines. Thankfully, it doesn't take much: All you need to do is pass an appropriate configuration object to boto3's client() or resource() function via the config named argument. The only slightly tricky part is that you need to use botocore's configuration object:

<div class="code-wrap"><code>import boto3

import botocore

# Access S3 with adaptive retries enabled:

ADAPTIVE_RETRIES = botocore.config.Config(retries={

"total_max_attempts": 4, # 1 try and up to 3 retries

"mode": "adaptive"

,})

s3 = boto3.resource("s3", config=ADAPTIVE_RETRIES)

# Test adaptive retries: Can we break camel's back?

bucket = s3.create("camel")

for index in range(0, 10000):

straw = f"straw #{index}"

bucket.upload_fileobj(straw.encode("utf8"), straw)</code></div>

import boto3import botocore# Access S3 with adaptive retries enabled:ADAPTIVE_RETRIES = botocore.config.Config(retries={ "total_max_attempts": 4, # 1 try and up to 3 retries "mode": "adaptive",})s3 = boto3.resource("s3", config=ADAPTIVE_RETRIES)# Test adaptive retries: Can we break camel's back?bucket = s3.create("camel")for index in range(0, 10000): straw = f"straw #{index}" bucket.upload_fileobj(straw.encode("utf8"), straw)

Code example for configuring boto3 to use adaptive retry mode

2.4 Beyond AWS

I apologize if the code example seems underwhelming. AWS engineers did all of the hard work here already, we just need to enable it. However, if you are looking for a coding challenge, consider what it would take to provide similar functionality for another cloud service. A likely source of frustration is the need for implementing similar algorithms for retries and rate limits for the server as well as the client libraries in every language.

In fact, that's already happening for AWS. Several of the client libraries for other programming languages as well as the Android and iOS SDKs support automatic retries for authentication errors caused by clock skew. However, boto3 does not. The corresponding issue has been open for four years now.

That raises the question if we can do better. My answer is an emphatic “possibly maybe.” I base that decisive assessment on the observation that exponential backoff with jitter and the token bucket algorithm are both techniques for best guessing an appropriate delay for retries based solely on whether requests succeeded or failed. Yet the server generating these responses knows quite a bit more. Since it enforces rate limits, it knows their basic configuration and it knows their client-specific state.

Hence the above question becomes: If we expose some of that server knowledge to clients, can we dispense with token buckets? My answer remains an emphatic “possibly maybe.” HTTP already includes the Retry-After header to provide a hint for the retry delay and an IETF draft standardizes several more headers that expose even more information.

But the Retry-After header applies only to a single operation and thus becomes meaningless under sustained load, much like exponential backoff is insufficient under sustained load. Ben Ogle wrote a blog post exploring just that question and ended up showing that the token bucket algorithm works real well. But that outcome also was predictable because he applied the Retry-After header only to the next retry for that request. In short, the question of whether we can design a more general header remains open.

In Conclusion

We explored retries as a simple and effective fault tolerance mechanism in significant detail. While retries apply to many cases, they aren't applicable in general. First, retries are only safe for idempotent operations. Second, they can only change the outcome upon non-deterministic failures, which include transient, rate-limit, and time-skew errors. Third, retries must be delayed, typically through exponential backoff with randomized jitter, to avoid denial-of-service attacks.

When operations occur at regular intervals, retrying individual operations isn't enough. Instead, we may have to throttle the request rate across all operations. This blog post used AWS and its client library for Python as running examples. Notably, it included source code for configuring throttled operation. But the techniques described in this blog post are general and supported by many other client libraries. Now that you understand all the necessary background, it is your turn to try retries — or to retry them again!

At the same time, the single most important take-away from this blog post is not the particulars of retries, rate limits, and AWS’ exemplary support for them but rather an understanding that handling failure in distributed systems is genuinely hard — you are about to finish a 3,900 word article about “doing it again” — and critically depends on the exact semantics of error conditions. To put that differently, getting the normal code path working is easy. That’s also when the real engineering work starts!

Enigma Haus - A New Look at the Future of Remote Working Collaboration

Blair Dawson — Tue, 07 Sep 2021 00:00:00 GMT

A positive workplace culture has always been one of my top priorities when job searching. But it’s hard to get a sense of company culture in interviews. Companies often tout free snacks, ping pong tables, and unlimited coffee. But I think the real culture of a company is something you feel once you’re in the day to day, interacting with colleagues.

I was hesitant to start the interview process for a marketing manager position at Enigma: the company was based in New York City and I would be working remotely from Arizona. Was it possible for Enigma to keep an energized and engaging culture in a fully remote world?

One month into my new job, I can tell you: I’m a believer.

Work from Anywhere

Forced to shut their office doors in March 2020, like many companies Enigma took the pandemic as an opportunity to rethink what their future of work looks like.

Enigma released a new “work from anywhere” policy. The company would retain office space in New York for those who want to work there, but it was no longer required. “It was important to me to lead by example and make sure our team knows we mean what we say,” said Hicham Oudghiri, Enigma’s CEO. “So my family and I permanently moved out to Los Angeles.”

Enigma remains dedicated to maintaining culture with a remote workforce. They found one creative way to do that, once it was safe to resume travel: “Enigma Haus.”

Enigma remains dedicated to maintaining culture with a remote workforce. They found one creative way to do that, once it was safe to resume travel: “Enigma Haus.”

The Enigma Haus

The concept for Enigma Haus started on a call, when Hicham and a few other Enigmites shared they wished they could be back together in person. The office building was still closed, so options were limited. Hicham decided to book an Airbnb in Los Angeles and a few vaccinated team members flew out to join him. “We had such a great time and accomplished a lot face to face, so we decided to roll out the concept more broadly,” Hicham told me.

With a growing team increasingly spread out across the country, Enigma leaders decided to further test the concept. They invited the full company to gather in a central location for a few days to brainstorm in person, build relationships, and have fun.

Inside Enigma Haus NYC

I arrived in the Big Apple on a cloudy, 70-degree Tuesday afternoon day. I hadn’t been to New York since high school. It was exciting to see the tall buildings again and hear the hustle and bustle of big city life. The streets felt so alive, and you could tell people were excited to start returning to their everyday commutes to work, shopping, and hugging old friends.

When I arrived at the townhouse that would be the company’s home base, I was greeted by VP of People, Stephanie Spiegel, who coordinated all of the thoughtful details throughout the week. She described the Enigma Haus experience as “a homecoming of sorts, seeing colleagues I hadn't seen in person in over a year and meeting new ones for the first time. It was going from a 2D screen to a real person standing in front of me.”

Each day consisted of working hours where team members could drop in and work from the Haus. Every evening there was an event: a magician, a jazz trio, and, of course, dinners. My schedule was filled with in-person meetings with my team, great food, and a lot of fun.

I got to meet countless coworkers who I had only ever met, Brady Bunch style, through the small squares of a Zoom call. While video calls certainly help to get to know your teammates, spending time together in person just can’t be matched.

So often, video calls get right down to business and end with people having to cut things short to jump to their next call. You miss out on the small opportunities for connection that happen more naturally in person. Hearing what people did over the weekend, learning about peoples’ families, or just enjoying a lunch together goes a long way in creating meaningful connections. The Haus allowed people to come and go as they pleased and truly created an atmosphere of collaboration and energy. I headed back to Arizona feeling more connected to my teammates and Enigma's mission.

The recent rise of COVID-19 variants has meant postponing or changing locations for our upcoming gatherings. But when it’s safe again, I can’t wait to reconvene Enigma Haus in person.

The Future of Work at Enigma

Enigma is setting a precedent for the future of work by investing in our people and creating a culture that energizes and excites people. That means things like a monthly wellness stipend, small team and company retreats, and opportunities to come together at Enigma Haus locations across the country.

If this sounds like your kind of culture, check out our job openings. We’re always looking for talented engineers, data scientists, and more.

How to Identify a Future Unicorn (Before Your Competitors)

Enigma — Thu, 12 Aug 2021 00:00:00 GMT

Halfway through 2021, small business lenders and investors are all asking themselves the same question: “How can I identify opportunities for growth without increasing risk?”

The U.S. economy is rebounding rapidly with 6.5% growth in the second quarter of 2021. For lenders, the recovery represents an opportunity to expand their portfolio. Lenders who are the first to identify growing businesses will be poised to gain market share.

Lenders who are the first to identify growing businesses will be poised to gain market share.

However, the dearth of reliable data about small businesses makes early identification of growing businesses a challenge. By the time there are publicly available markers of growth, like press releases or fundraising rounds, you’ve lost the edge and competition has increased.

Revenue growth can be a leading indicator of small business growth. To highlight how this data can be used by both small business lenders and institutional investors, we looked at two companies in different industries and growth stages.

Company 1: Direct-to-consumer apparel

Company 1 is a direct-to-consumer online apparel company that saw slow but steady growth from 2017 to 2019. In the second half of 2019, we see a sharp spike in revenues aligning with the winter holidays. In Q1 2020, monthly card revenues were almost 100% higher year-over-year. At the end of 2020, they again hit record revenue growth and raised $40 million in funding. Enigma’s card revenue data was able to show an acceleration in revenue growth more than a year before the fundraising round.

Company 2: Alcohol delivery start-up

Company 2 is an early-stage alcohol delivery start-up founded in 2019. The company didn’t see much growth in their first year as they built out their team and product. In the second half of 2020, however, revenues rose rapidly as the company was well-positioned to meet new demand triggered by the pandemic. In June of 2021, the company raised another round of funding. Enigma was able to flag a sharp uptick in revenue growth significantly earlier, enabling strategic relationship building and credit line increases to occur before competition arrived.

Incorporating card revenues and growth data into your process

The above two examples show how an acceleration in card revenues growth can be a leading indicator of business success and funding events. The best way to incorporate this data depends on your use case and current process.

Small business lenders often find this data is orthogonal to their existing sources and integrate it into their models. Investors may find it more helpful to use triggers and identify businesses whose growth rate crosses key thresholds.

Get in touch to discuss how Enigma’s revenues and growth data can help your organization.

4 Insights About Small Business Performance During the Pandemic

Enigma — Tue, 03 Aug 2021 00:00:00 GMT

Key findings from our sample of 2.7 million U.S. small businesses:

More than 50% of the largest businesses in our sample saw revenues grow during the pandemic
Very small businesses were 6x more likely to cease operations during the pandemic, compared to larger businesses.
17% of very small businesses ceased operations during the pandemic
Less than 1% of the largest businesses in our sample ceased operations during the pandemic

Which small businesses suffered most?

As part of Enigma’s ongoing research into the health of the small business economy, we analyzed how small businesses of different sizes fared the pandemic. What did we discover? The smallest businesses suffered the most.

The very smallest businesses were six times more likely to cease operations during the pandemic, compared to larger businesses. Conversely, more than 50% of the largest businesses in our sample actually saw revenues grow during the pandemic.

Which businesses survived and which businesses thrived?

Only about 30% of the smallest businesses in our sample were able to grow card revenues during the pandemic. We anticipated that the smallest businesses would be hit hardest, but were still surprised by the extent of the damage.

More than 17% of the smallest businesses in our segment ceased operations entirely during the pandemic. Conversely, less than 1% of larger small businesses (those with more than $300,000 annual card revenue before the pandemic) ceased operations entirely during the pandemic. Smaller businesses that stayed open still saw a much steeper decline in revenue, on average, than their larger peers.

We also examined how many businesses saw their card revenues grow during the pandemic. The larger a business was, the more likely it was to see revenues grow. Only 37% of the largest businesses saw revenue drop during the pandemic.

Looking ahead

Even among micro businesses, a substantial portion of businesses were able to increase revenues in 2020. Identifying businesses that have weathered the last twelve months and are now poised to grow will be a top priority for lenders in 2021.

Our methodology

We looked at a sample of 2.7 million private businesses across the United States in our Merchant Transaction Signals dataset. For each of these businesses, we analyzed monthly revenues based on aggregated, anonymized credit and debit card transactions.

We segmented businesses into six groups based on their average annual card revenue from March 2019 - February 2020, the twelve months before the Covid-19 pandemic took root in the United States. More than half of our sample businesses had less than $300,000 in annual card revenues in this 12-month period.

We then compared the monthly growth or decline of card revenue at each business over the 24 months from March 2019 to March 2021. If transactions disappeared entirely for more than 3 months, we assumed the business had ceased operations.

The Future of Work

Stephanie Spiegel — Tue, 23 Mar 2021 00:00:00 GMT

Over the past year the ways in which we work changed dramatically. In an instant we all became remote. We had to learn new ways to structure our time, new ways to be productive, and new ways to feel connected to our colleagues. We still struggle in this new way of working and miss the togetherness of being in an office, but we recognize the seismic shift in the way work is done.

At Enigma, we’ve committed to support our employees to the fullest. We spent time listening to the new needs of everyone across the company.

We launched surveys to gather more data about which tools and processes served us well and which needed to go. We consolidated this into a complete picture of how we can support everyone’s needs around remote work.

How We Got Here

Our approach to how to organize ourselves in this new work environment arose from the trust we have built together. Trust that we have each other's back, that we remain focused, collaborative, and can perform the data science that is our mission, all while being miles and miles apart. This trust has positioned us in a place of strength.

In a sense, the possibility of working remotely from wherever you may choose is liberating. But on the other hand, almost every interaction we have with our colleagues underscores the feeling of how much we miss each other, how there is no comparison to working side by side, in real life.

In our opinion there is an approach that embraces autonomy, flexibility, and fundamentally puts our people first by giving them freedom of choice alongside the infrastructure they need.

Going Hybrid-Remote

So how do we balance these things? In our opinion there is an approach that is simply less prescriptive. An approach that embraces autonomy, flexibility, and fundamentally puts our people first by giving them freedom of choice alongside the infrastructure they need.

What does the future of work look like at Enigma? It starts with a few concrete things:

We will always have an office in NYC, large enough to accommodate everyone and open as soon as it is safe to return to work.
Anyone can choose to come to the office or be remote; that choice is up to you and we will all strive to support each other equally.
Neither choice is binary or set in stone. Go remote for a month if you'd like, get that Airbnb in Barcelona or New Mexico. Go to the office three days a week, even if you're in NYC.
If clusters form in new geographies, we will open small offices there to accommodate in-person work.

We will always have an office, but also want to support our employees to plan their lives without strict geographic limitations.

That's Not All

A lot more goes into getting this right and supporting the evolution of how we work. Some considerations (there will certainly be more over time as we learn):

We will have in-person events for everyone to attend. We are committed to getting together in person multiple times a year and will pay for the travel for those who are remote.
We are explicit about our responsibilities towards each other, like the time zones we expect to band around, and expectations that each team recognizes when it is important to get together in-person and travel for certain kinds of work.
We recognize that life during COVID and working remotely takes a mental toll. We’ve invested more into employee wellness, both physical and mental, and have made sure that everyone has somewhere where they are "set up" appropriately to work (whether at home or a co-working space near them):*
All employees receive a $100 monthly wellness stipend that can be used for continued wellness.
All employees enrolled in one of our health care plans receive complimentary membership to One Medical.
All employees receive complimentary access to Headspace, Spring Health, Aaptiv, Carrot, and Quit Genius apps.

To sum it up: we will always have an office, but also want to support our employees to plan their lives without strict geographic limitations. We are 100% committed to our team having this freedom at Enigma and invested in making it work for us all.

It's going to be fun. We will have new opportunities and a broader reach. Challenges will arise, but we will work together to redesign the future of work by balancing togetherness and freedom**.**

We’d love for you to join us. If our vision of the future of work aligns with yours - check out our open roles.

*All of our benefits and perks won’t fit into this blog post! Learn more about Enigma’s robust benefits and perks.

Comparing SMB Health Across 50 U.S. Cities

Enigma — Thu, 18 Mar 2021 00:00:00 GMT

Comparing SMB Health Across 50 U.S. Cities | EnigmaAcross the country, 2020 was a difficult year for small businesses. But as the pandemic tore through the United States, it didn’t impact every city equally. Some areas saw higher rates of infection. Some states imposed stricter regulations on small businesses as they tried to limit the spread of COVID-19.

At Enigma, we were interested in understanding how small businesses fared across the country. We analyzed key health metrics for millions of businesses in 50 U.S. cities for the full year of 2020. Our findings? While businesses everywhere were hard-hit, some cities were hit almost twice as hard as others.

While businesses everywhere were hard-hit, some cities were hit almost twice as hard as others.

Many common economic indicators, from GDP to the performance of the S&P 500, hide what’s happening with small businesses. We leveraged the power of Enigma’s Merchant Transaction Signals to look at card revenue trends on a business-by-business level. Card revenue trends are especially powerful for consumer-facing businesses like restaurants and retail stores - industries highly impacted by the pandemic.

We examined the 50 cities with the highest number of small businesses, and looked at three key indicators:

How many business stopped operating?
How many business had growing revenues in 2020 compared to 2019?
How many business had declining revenues in 2020 compared to 2019?

Full list of cities at bottom of article.

How many businesses stopped operating?

We started by looking at how many businesses stopped operating in 2020. To do this, we looked at businesses who had some revenue during 2020 but did not have any revenue during the final 3 months of the year.

In every city we analyzed, at least 20% of businesses stopped operating. In Chicago, more than one-third of all businesses ceased operations in Q4, either temporarily or permanently.

Around 40% of businesses saw revenues grow

Next, we looked at how many businesses experienced revenue growth in 2020. We found that no area escaped unscathed - across all 50 cities, a minority of businesses saw card revenues grow in 2020 compared to 2019. In the city with the fewest growing businesses, San Jose, only 31% of businesses saw revenues grow. Across New York City, around 45% of businesses saw card revenues grow.

A Q4 recovery, especially in the sun belt

The silver lining? Almost all cities saw an uptick in revenues in the final quarter of 2020. Sunny cities in areas with low levels of pandemic restrictions led the pack. In Orlando, Miami, Tucson, and Phoenix more than 60% of businesses saw card revenues grow in the final three months of the year compared to the previous 3 months. Houston, Tampa, and San Antonio were just behind with around 58% of businesses showing card revenue growth.

2021: a dynamic time for small businesses

Our analysis shows how diverse the effects of 2020 were on small businesses. Around one-fifth of businesses ceased operations entirely, while 40% actually saw revenues grow.

With economic recovery on the horizon, in 2021 it will be crucial for lenders to understand which businesses are poised to grow and in need of capital. To learn more about how Enigma can help, get in touch.

Finding Early Warning Signs of Business Bankruptcy

Enigma — Fri, 05 Mar 2021 00:00:00 GMT

What if you could predict when a company was going to go bankrupt, months in advance? Monthly revenue data can help you do exactly that - identify businesses in distress long before they default on payments. We used Enigma’s bankruptcy data and merchant transaction data to highlight how revenue trends can be a leading indicator of business distress. Keep reading to learn what we discovered.

A case study of three bankruptcies

Bankruptcies are a clear sign that a business is distressed. However, by the time a company files for bankruptcy, it’s often unable to pay back many of its debts. This is why most risk professionals supplement bankruptcy data with other data that can provide early alerts of distress.

Below, we highlight three businesses that filed Chapter 7 bankruptcies in December 2020. The businesses were in different cities and different industries, but all were hit hard by the pandemic. For each of these businesses, we found that card revenues were a leading indicator of their distress.

Business 1: A fitness center in Dallas

According to corporate registration data, Hot Bodies Gym* was founded in 2003. Long before the pandemic hit, card revenues were steadily declining. In 2017, their average monthly card revenue was around $380,000. By the end of 2019, it had fallen to around $200,000. Once the lockdowns began in 2020, it was down to a little over $100,000. Although the gym didn’t file for bankruptcy until December 2020, the revenue data show a business clearly in decline for several years prior.

Business 2: A restaurant in Austin

Darlene’s Diner* opened in 2018 in Austin, TX.and saw steady card revenue growth in its first year. However, starting in November 2019, card revenues began to steeply decline. Just before Austin introduced its first stay home order in March 2020, average monthly card revenue was only 40% of what it had been a year earlier.

Though Darlene’s made a small recovery during the warm summer months, it was never able to reach its early revenue numbers. In October the restaurant closed its doors and in December it filed for Chapter 7 bankruptcy. Looking at the card revenues data, we can see that the business was distressed a full year earlier.

Business 3: A dry cleaner in Los Angeles

So Fresh Dry Cleaners* in the Los Angeles metro area first registered as a sole proprietorship in 2015. Throughout 2018 and 2019, monthly card revenues declined slightly but were mostly stable, between $45,000 and $55,000. When the pandemic hit, monthly revenues cratered and never rose above $22,000. In December 2020, the company filed for bankruptcy.

Lagging indicators are not enough

As the above examples show, lagging indicators can leave lenders exposed. Often by the time a business defaults on payments, it has already ceased operations.

Get in touch to discuss how Enigma’s revenues data can help your team mitigate risk.

*All business names have been changed to protect privacy.

Which Lenders Have the Highest Performing “PPP Portfolios”?

Enigma — Fri, 05 Feb 2021 00:00:00 GMT

Every lender wants to know how their portfolio compares to their peers, but typically it’s a struggle to get this kind of competitive intelligence. Using Paycheck Protection Program (PPP) loans as a proxy, we analyzed how lenders’ small business portfolios fared in 2020 - read on to discover the top 50 rankings.

The analysis

In December 2020, the SBA released a trove of data about the PPP loans, including loan recipients, loan amounts, and lenders. We combined that data with our Merchant Transaction Signals to examine the performance of 706,000 businesses in our database that received PPP loans from the 50 lenders with the highest volumes of loans. We looked at card revenue growth at each of the businesses and used this analysis to create a comparison of the performance of each bank’s “PPP portfolio.”

Everybody suffered, but some more than others

2020 was a tough year for small businesses. Unsurprisingly, our data showed that in every lender’s portfolio a majority of PPP loan recipients saw card revenues (the amount of revenue a business receives from credit and debit card transactions) decline. Across all 50 lenders, only 35% of these PPP recipients saw card revenues grow in 2020. By comparison, 60% of the same businesses had growing card revenues in 2019.

Digging into the data, performance of PPP recipients varied substantially across lenders. The best performing lender was BancorpSouth Bank, with 45% of PPP recipients showing card revenue growth. The worst performer, Citibank, had only 29% of PPP recipients that grew card revenues in 2020.

Bigger isn’t always better

What kind of lenders had the best performing PPP portfolios? The top 10 was dominated by regional banks with less than $30B AUM: institutions like Glacier Bank, BancorpSouth, First Interstate Bank, and United Community Bank. Conversely, four of the five largest U.S. banks - Citibank, Bank of America, Wells Fargo, and U.S. Bank - were in the bottom 10.

What kind of lenders had the best performing PPP portfolios? The top 10 was dominated by regional banks with less than $30B AUM.

Part of this may be due to a sacrifice of quality for quantity. We found a negative correlation between how many PPP loans a lender processed and the performance of their portfolio. While the top 10 performing lenders granted a median of 4,376 loans to businesses in our database, the worst 10 performing lenders granted a median of 16,219 loans.

Want to get a deeper look into the financial health of your portfolio and loan applicants? Get in touch to learn more about how Enigma’s data can help.

Introducing Merchant Transaction Signals

Enigma — Mon, 01 Feb 2021 00:00:00 GMT

In a year of record volatility, lenders have struggled to access timely and reliable data about small businesses. Enigma’s Merchant Transaction Signals address this gap, providing key signals about the financial health and growth of small and medium businesses across the U.S.

The data

Enigma has partnered with leading banks and payment processors to aggregate and analyze a panel of hundreds of millions of anonymized credit and debit cards. Using proprietary entity resolution techniques, we’ve transformed these raw transactions into leading indicators of growth and risk at more than 10 million U.S. small and medium businesses.

Available a la carte or as a bundle, these attributes provide a clear picture of business health:

Card Revenues - Reports on the average monthly revenue a business receives from credit and debit card transactions.
Card Transactions - Reports on the average monthly card transactions at a business over the previous one-month, three-month, and twelve-month periods.
Customer Counts - Reports on the average daily number of customers at a business over the previous one-month, three-month, and twelve-month periods.
Revenue Growth - Reports on the average growth rates of a business’s card revenue, both over the previous three-month period and adjusted for seasonality.

How can I use this data?

You can use any of these attributes on their own or combined with Enigma’s other data about business identity and operations.

Risk and underwriting teams can gain deep visibility into an applicant’s financial position before they submit paperwork. Set smarter initial credit limits and more accurately predict a business’s future spend. With data that’s refreshed each month, you can more closely monitor risk and take actions to mitigate damage before a delinquency event occurs.

Merchant Transaction Signals can also help marketing and sales teams accelerate their customer acquisition. From ridding your lead database of shuttered businesses to identifying your best prospects, up-to-date intelligence on a business’s growth helps you maximize the ROI on your direct marketing and prospecting.

What makes our data unique?

Fresh - Our data provides signal about business health months before traditional credit scores, and is refreshed every month.
Coverage - We cover more than 10 million businesses - and our coverage is constantly growing. Our coverage is especially strong for volatile industries like retail and restaurants.
Accuracy - Our attributes are derived directly from the purchases made at a business, rather than estimated from loose firmographics.
Seamless - Streamlines your customer experience and reduces the burden on applicants - no need to ask customers to submit additional paperwork or integrate their accounts.
Versatile - Empower your marketing team with the same data as your risk team. From list enrichment to API, we deliver the data in the way that makes the most sense for each team’s workflows.

Interested in learning more? Get in touch to request a sample.

Related download: Merchant Risk Overview for Payments Providers (PDF)

How to Enrich a List of Businesses With Enigma’s Batch Upload

Enigma — Tue, 08 Sep 2020 00:00:00 GMT

We’re excited to share a new way to access Enigma’s small business data. Batch Upload makes it easy for anyone to access Enigma’s vast data, from Business Closures to Corporate Registrations, without directly interacting with the Enigma Businesses API.

By uploading a CSV, you can rapidly enrich an entire list of businesses. Use Batch Upload for workflows that require information on a group of businesses, whether it’s a subset of your portfolio that you’re periodically monitoring for a specific signal, or a new leads list that you’re enriching to inform campaign segmentation.

Batch Upload is one of the simplest ways to get more data about your small businesses. Let’s see just how easy it is to use this tool:

Prep your CSV file

First, you need to populate your CSV with basic business information such as business name, address, and associated persons. This information will need to be formatted with specific column headers: business_name, street_address1, street_address2, city, state, postal_code, first_name, last_name. You can also use our sample CSV file, which is pre-formatted with the necessary headers.

Upload the file and configure

Once your file is ready, go to the Batch Upload tool in Enigma console and follow the steps to upload the file via drag and drop. This will kick off a few customization options for you:

Query Parameters allow you to determine match thresholds, the number of matches returned, and whether or not you want non-matches to be returned. Read more about the implications of query parameters here.
Premium Attribute Selection enables you to pick the premium attributes with which you want to enrich your list. Our premium attributes include Detailed Industry Classification, near real time Business Closures, and WARN Act Notifications (employee layoffs) — you can explore our entire catalog for more details. If you only want basic attributes, just select “Skip and Submit”.

After you’ve set your configurations, hit “Submit Batch”.

Retrieve and dig in

When your file is ready, you’ll receive an email that will link you to the enriched CSV. In the interim, you can always check on your file status by reviewing the Batch Activity tab within Batch Upload.

These simple steps are all that’s required to get fresh, trusted data appended to your list of small businesses. If you have any questions, feel free to reach out. To get started with Batch Upload, log into your Enigma account, or create your free account here.

50% of Small Business Lenders Don’t Think This Recession Will End By August 2021

Enigma — Wed, 26 Aug 2020 00:00:00 GMT

Small business financial services providers face a very different economic reality than they did at the beginning of this year. The COVID-fueled recession has resulted in a new climate for small business credit and lending, which continues to evolve rapidly depending on factors such as public health developments, government policies, stimulus activities, and more.

In times of uncertainty, it’s helpful to know what your peers are thinking. With this in mind, we connected with around 30 senior leaders in small business risk, credit, products, underwriting, and beyond (names and institutions have been withheld to allow for full candor). Nearly 50% of participants are from community banks — the local go-tos for many small businesses — who are bearing witness to what’s happening on the ground right now. We also heard from voices representing top 50 U.S. banks and regional banks, as well as fintechs and online lenders.

Discover what these senior leaders are saying about the recession’s impact on small business credit and lending today, and what they predict for the future.

Lending or credit extension activities have not slowed dramatically in this recession.

While some organizations experienced initial slowdowns in small business lending and credit earlier this spring, the majority of institutions have since rebounded. A few banks mentioned record lending activity, especially with SBA backed loans (both inclusive and exclusive of PPP).

A leader at one of the largest SBA lender banks in the US stated, “We stopped our ‘traditional’ SBA lending in April to focus solely on PPP loans. We stopped our PPP loan process in May and resumed our regular SBA program.

"We funded almost as many SBA loans in 2 months of Q2 as we did in 3 months of Q2 in 2019. Q3 looks very strong for SBA lending."

Another community bank leader mentioned his $3B bank has extended over 7000 PPP loans - quite a feat for a bank of this size.

Where has small business lending slowed down? Online lenders. All 3 leaders from online lending organizations described huge reductions in lending and credit activities. “We are not lending any more money, even to existing customers at the moment”, mentioned one respondent. Another stated, “It has forced an almost complete shutdown of credit to SMBs.”

Aside from online lenders, most institutions are maintaining or increasing their engagement with small businesses despite the recessionary climate. However, some respondents mentioned more cautionary practices, which brings us to our next learning:

Small business credit and lending standards are tightening.

More than half of the leaders predicted that lending practices will tighten in this new economic environment — and several people mentioned that their credit and loan standards have already increased.

Several leaders mentioned that they expected tighter policies to remain in place for the foreseeable future, as a direct effect of the recession. A fintech leader (not an online lender), stated

“As the recessionary environment continues, we have limited our loan exposures and tightened our lending criteria. We currently plan to continue these policies until the small business sector clearly has recovered, whenever that may be.”

As the recessionary environment continues, we have limited our loan exposures and tightened our lending criteria. We currently plan to continue these policies until the small business sector clearly has recovered, whenever that may be.

Industry plays a role in determining new credit standards as well. This was not surprising — in Enigma’s State of the Small Business Lending poll in early June, many participants openly stated they were looking for better industry data to inform credit risk assessments for different industry sectors.

Multiple leaders mentioned that policies were stricter for industries that have been hit harder by the pandemic. ”Going forward, we have tightened lending standards to certain industries impacted by the pandemic, such as commercial real estate and hospitality industries such as hotels and restaurants. We expect those vertical lending areas to shrink materially in the next year,” explained one community bank leader.

While tightened credit standards are a predictable reaction to the current recessionary environment, it presents a tough reality for small businesses in search of new capital or credit. Especially as many financial institutions believe small business lending and credit activity will not recover to pre-COVID rates for quite some time.

More than 50% of institutions do not believe small business credit and lending activities will recover by August 2021.

Very few leaders expressed confidence in a full return to pre-COVID lending and credit activities within a year. One community bank leader explained, “We clearly believe small business lending and credit activity will not substantially recover within a year... our fundamental view is that banks will increase underwriting standards and avoid aggressive lending, especially to vulnerable industries such as hospitality, commercial real estate, and energy, until there is much more visibility on the performance of deferred loan books and their ultimate impact on banks income statements and balance sheets.”

Leaders mentioned that factors such as the arrival of a vaccine, the forthcoming U.S. presidential election, the depth of government bailout activities, and business reopening timelines will all impact whether small business credit and lending can substantially recover by August 2021.

Several leaders described a challenging cycle wherein the declining state of small businesses informs the extent and availability of small business financial services, which then influences the state of small businesses (and so on). Many small businesses may not survive this recession, and a significant number of surviving small businesses will still face financial challenges, thus the general small business lending and credit economy will likely shrink.

One online lender shared their view of this dynamic, “Many SMBs have shut down and will not reopen as the same legal entity, if they reopen at all. Many lenders require 12 months of legal existence before they will lend to an SMB. Also, those SMBs that do survive will most likely have reduced cash flows, thereby reducing the amount of borrowings for which they will be eligible.”

How small business financial services continue to evolve in this recession remains to be seen. Though most leaders predict a longer recovery, these beliefs were shared by the same people who report that the current state of small business credit and lending is far from apocalyptic.

The past has shown that challenging economic climates can beget innovation. As one leader from a Top 50 bank mentioned, “there have been multiple discussions on offering customized new products for small businesses in the future.” We look forward to seeing what new services and offerings financial institutions and fintechs release in the coming months as they navigate this recession.

It’s also apparent that new kinds of data and signal will play an important role helping institutions navigate this recession. More than half of leaders agreed that their institution would benefit from additional signal to improve visibility into their small business risk exposure. This may in part be due to lagging data from traditional credit bureau providers — as one leader described, “Credit evaluation services have not kept pace with the situation, so that available credit rating information is often out of date or inapplicable.” Much like this recession will spark new product development, we expect innovative evolutions for small business data, specifically business risk and health indicators.

Did any of this post’s ideas or quotes resonate with your current experiences in small business credit and lending? We’re curious to hear from you. Get in touch to let us know what you’re seeing, and let us know if you’d like us to cover any other topics with our lending community in the future.

How Data Decay Can Spoil Your Small Business Database

Kinsey Sullivan — Mon, 17 Aug 2020 00:00:00 GMT

Despite providers’ best efforts, small business data doesn’t stay fresh for very long. Studies indicate that small business data decays as much as 22% every year.

Data decay is the gradual loss in data accuracy, coverage, and reliability. It’s simply the idea that even the best quality data will deteriorate over time.

Data decay can happen when hardware or software is damaged, but it also happens naturally as entities change and the information on record becomes outdated.

For lenders and other small business service providers, this lack of reliable, fresh small business data creates problems for every department, from marketing to compliance. Data decay not only compromises your growth potential, but also creates an environment where risk can flourish.

Data decay affects critical functions across your organization — and ultimately impacts your bottom line.

Marketing vs Data Decay

If your marketing efforts run on poor quality data, it’s impossible for your message to connect with the right audiences at the right time.

Data decay results in wasted resources on outbound messaging that doesn’t get delivered, chronically low response rates, and an inability to optimize your campaigns. With deteriorated data, you could miss out on a company’s recent layoff events that would signal that this business is not a good prospect and use of marketing spend.

In marketing, it’s easy to see the garbage in, garbage out rule in action. Decayed data means reduced marketing ROI.

Customer Onboarding vs Data Decay

Small business data decay can lead to unnecessary challenges, and even added risk, during SMB customer verification and onboarding.

Data decay might prevent you from learning about a credit applicant’s new risk factors like a recently- expired professional license or at the very least, it might require manual information gathering, that slows onboarding and negatively impacts customer experience.

Being confident in the quality of your data ensures you can streamline onboarding, improve customer experience, and reduce risk.

Risk Monitoring vs Data Decay

Your risk and compliance teams want to see new issues as they arise - not months later. The only way to achieve this type of monitoring is with fresh, reliable data.

Outdated or unreliable data creates an incomplete picture of risk. With decaying data, you’ll miss out on time-sensitive risk signals - like bankruptcy alerts and business closures - that enable you to proactively identify and respond to at-risk businesses.

In that way, decaying data is a liability.

Data decay means wasted resources, compromised customer experiences, and elevated risk. This is precisely why Enigma prioritizes freshness and accuracy in our small business data. Our data refreshes regularly (up to bi-weekly) to ensure we’re providing accurate insights about every small business.

If you are curious to learn more about Enigma’s small business data, please reach out for a demo.

How to Find New Small Business Credit Risk Signals

Enigma — Wed, 05 Aug 2020 00:00:00 GMT

It’s news to no one that U.S. small businesses are experiencing tremendous volatility in 2020. An estimated 1,000,000 very small businesses will eventually close, (businesses with fewer than 9 employees) and 100,000 small businesses have already closed permanently due to COVID-19 related challenges. Small businesses are not alone in experiencing extreme strain; it also affects the financial services providers that extend loans, credit, and other services to these businesses.

Within this dramatically different economic reality, a common theme has emerged amongst financial institutions and fintechs: Traditional credit bureau data does not provide enough risk signal on small businesses.

A common theme has emerged amongst financial institutions and fintechs: Traditional credit bureau data does not provide enough risk signal on small businesses.

High-latency credit reports are not delivering timely insight for assessing small business risk exposure. Former Kabbage Chief Risk Officer Kaustav Das described the COVID-19 era of data realities, “Data freshness becomes increasingly important. You cannot rely only on bureaus, you need more real-time data like bank data, processing data, or other alternative data that captures recency. “

Das isn’t alone in these beliefs about data freshness and diversity. When Enigma polled small business credit risk leaders for our June 2020 State of Small Business Lending Report, 1 in 3 respondents described themselves as actively looking for new credit risk signals.

To help you evaluate which kinds of data are a fit for your organization, we’ve created an overview of some of the newest credit risk signals available.

Small business credit risk signals

Transactions Stability

Looking at the frequency and stability of credit card transactions at a business can provide a strong signal for whether that business is actively operating. An absence of any credit or debit card transactions can signal a disruption in operations.

Bankruptcies, Layoffs (WARN Act Notifications) data

There are few attributes that more definitively signal distress than Business Bankruptcies and Employee Layoffs data. Bankruptcies data is sourced from official court documents, while Employee Layoffs data draws upon WARN Act Notifications, which require businesses to publicly declare layoffs of a certain magnitude.

(Accurate) Industry Classification data

Many credit risk professionals in our June 2020 poll said they needed better industry data to understand how resilient or at-risk their portfolio businesses are. While industry classification data has long been available, the reality is that some industry data can be inaccurate or not specific enough to provide enough insight. Accurate, granular industry data lets you better understand the risk of any business you’re engaging with.

Spending and transactions data

Transaction data provides a window into how a business is operating. Tracking transaction data over time can alert you to abnormal revenue patterns or decreases in business activity, enabling you to predict defaults or distress further in advance.

Foot traffic data

Pedestrian foot traffic can also provide a window into the health of brick and mortar businesses. You can benchmark recent foot traffic activity against historical patterns to see how a business may be doing — based on traffic to that business or the broader pedestrian activity within a specific geography.

These are just some types of data you can explore and test for signal on small business credit risk. While financial institutions are facing significant challenges alongside their small business customers, there is now an opportunity to evolve new, innovative data strategies. Expanding beyond traditional credit report data will inevitably result in insights that don’t just mitigate risk, but also lead to new products and better customer experiences.

If you are curious to learn more about the types of small business credit risk signal Enigma provides, please reach out for a demo.

Over $1B in PPP Loans Went to Now-Closed Businesses

Madeline Ross — Thu, 30 Jul 2020 00:00:00 GMT

It’s rare to see how our government dollars are working in real time. However, the U.S. Treasury recently released data on loans granted by the recently-established Paycheck Protection Program (PPP), providing unique visibility into how a huge government economic intervention is playing out right now.

After many calls for transparency, the US Treasury released fairly detailed data on the 700,000+ businesses that received $150,000 or more in PPP loans (less detailed data was also released about smaller loan recipients). Large PPP loan recipient businesses account for less than 20% of all loans, but about 75% of allocated PPP loan dollars.

At Enigma, we were interested in learning more about who these 700,000+ businesses were and how they’re faring now. We ingested the PPP large loan recipient dataset and enriched it with our own data about small businesses. To see how many businesses had shut down for good, we joined PPP data with our new Business Closures dataset, a proprietary distress signal that tracks businesses’ operating statuses in near real-time. We did a quick analysis and here’s what we found:

About 3,000 PPP large loan recipients have closed.

Using our Business Closures data, Enigma calculates that nearly 3,000 businesses that took more than $150,000 in PPP loans have since closed. Fewer than 1% of all PPP large loan recipients across the U.S. are now marked as closed in Enigma’s data, with substantial variation from state to state. What remains to be seen is how many businesses stay in operation as the U.S. continues to grapple with COVID-19 and its economic effects.

Nearly 150,000 jobs may have been lost due to closed PPP recipient businesses.

The numbers look a bit more bleak when you consider the implications for jobs. The PPP dataset discloses how many jobs each recipient business hoped to retain, so we can calculate how many jobs were not retained by the 3,000+ now-closed businesses. In California alone, almost 20,000 jobs have been lost by closed PPP recipients, followed by 12,000 jobs lost in both Texas and New York respectively.

Between $1.1 billion and $2.6 billion was granted to now-closed businesses.

We can also examine how much money was loaned to now-closed businesses. Because loan amounts were disclosed only in ranges, we have a range in estimates for money distributed to closed businesses: a minimum of $1B and potentially as much as $2.6B. While this represents a small percentage of the hundreds of billions of dollars distributed in PPP loans, it is still a staggering amount. We’re curious to see how many large loan recipients continue to operate over the next few months, knowing that economic turmoil and contraction are predicted by many experts.

We refresh our business closures data bi-weekly, so we’ll update you from time to time on how the landscape of large PPP loan recipients changes. Will more businesses close? How many jobs will be impacted by those potential closures? And how many funds will ultimately go to now-closed businesses?

If you’re a financial institution that’s interested in our near real-time business closures data, we’d be happy to speak with you. If you’re a media organization that would like to learn more about this data, please get in touch.

What is an Industry Classification Code?

Kinsey Sullivan — Wed, 29 Jul 2020 00:00:00 GMT

Understanding a business’s industry sounds straightforward, but that basic piece of information is surprisingly difficult to define, especially for small businesses.

A company’s industry provides context about nearly every aspect of its operations and how it makes money. It’s a key that helps unlock a wealth of data about a business. An industry classification code is a label or definition for a business’s industry, using a word- or digit-based categorization system.

There are many different approaches to defining a company’s industry, with different industry classification systems that have originated from different eras and organizations. Most organizations rely on one of the following four common systems.

NAICS

The North American Industry Classification System is a 2-to-6-digit code used to classify businesses by industry. This expansive, detailed system is a leader in mainly American industry classifications. The NAICS code is specifically valuable for businesses seeking government grants and certifications.

However, the expansiveness of NAICS codes can result in complex, broad industry groupings. For example, a single NAICS code stands for “Administrative and Support and Waste Management and Remediation Services.” This includes everything from septic tank cleaning to temp agencies.

For many uses, a NAICS code provides a good starting point but lacks granularity, which can result in increased risk exposure if you don’t have a complete understanding of a business’s activities.

GICS

GICS, or Global Industry Classification Standard, is an industry classification system used by financial institutions and systems around the world.The GICS organizes companies first into one of 11 sectors, then into increasingly more detailed industry groups, industries, and sub-industries.

A full GICS classification is an 8-digit code with a text description that reflects this hierarchy of detail. The first two digits refer to a sector; the first four refer to a sector and industry group; the first six digits reflect sector, industry group and industry; and the full 8-digit code reflects sector, industry groups, industry, and sub-industry.

This classification system provides accuracy industry definitions, but that accuracy comes at a cost: it sometimes fails to capture the full picture.

SIC

The Standard Industrial Classification is a 4-digit code developed and used primarily by the U.S. government to define industry areas. It has been widely replaced by the NAICS system, although some major agencies - from the U.S. Census Bureau to the SEC- continue to use the SIC codes.

This system is older, so it struggles to capture new and emerging fields, and still leans toward manufacturing rather than service industries.

If you’re looking for data on contemporary, technology-based businesses, you may find that the SIC system is limited.

Enigma

Enigma’s industry classification system is designed to address some of the accuracy, coverage, and usability gaps they present, but still maps to other common classification systems.

Our system is designed to give reliable insight into business model and risk. Enigma’s industry classification system reflects contemporary business models, such as hybrid online and brick-and-mortar companies, and provides details about how a business operates.

Given that data about small and mid-sized businesses tends to vary in quality and accuracy, this is an area where industry classification codes can suffer in terms of accuracy.

In testing, Enigma’s industry data outperforms incumbent data providers by 2-3X in terms of accuracy and coverage of small businesses. By September, our industry classification coverage will extend to 80% of small businesses.

You can demo Enigma’s Industry Classification data via our API or batch upload tool. Access it here.

Team Spotlight: Applied Technologies at Enigma

Juliana Sullam — Wed, 22 Jul 2020 00:00:00 GMT

Enigma’s Applied Technologies team is responsible for delivering scalable, highly available solutions to our largest customers. We interviewed senior software engineer Clinton Monk about his experiences developing a screening API to prevent money-laundering for some of the largest U.S. financial institutions.

What’s unique about this team?

Broadly speaking, we're a very highly collaborative team — we share all the responsibilities. At the same time, everyone on the team is expected to own at least some features, which means you’re working with Product to understand the requirements, you are writing the engineering designs, and creating the plan for implementation, all while ensuring it meets our customers’ needs.

The stakes are uniquely high for our team and our work. For one thing, we're integrated into our customers’ critical systems. If something goes wrong with our API, then it's a big deal; it’s going to affect our customer’s bottom line. We process PII and need to handle it carefully and responsibly according to proper data handling policies. Lastly, we also have a very high throughput SLA for our API because we need to process many transactions and accounts at the same time. These all present difficult, distinct technical challenges.

Can you speak about the throughput challenge a bit more?

It was about scalability. We needed to meet an SLA of 2000 requests per second.

First, we set up a distributed load testing suite by deploying Locust to an ECS cluster in AWS. We then added additional CloudWatch metrics to measure latency of specific Python modules in our API. This let us measure current throughput as well as measure the effects of changes.

Next, we started making changes. Testing with only one API instance, we looked for bottlenecks in the API request handler. We identified a few, refactored those functions, ran acceptance tests to ensure the functionality remained the same, and then ran the load tests to measure the improvements. We applied this same process to the gunicorn application server as well, testing different worker types as well as number of workers and threads. We continued these changes until we had addressed all of the easy or obvious improvements.

We then started scaling out the number of API instances. In doing so, new challenges arose. We needed to scale out other parts of our stack as well, including our message queues and our Elasticsearch cluster.

The challenge came down to a lot of investigation and exploration. We had the freedom to think about the problem and then find and select tools and approaches that helped us solve it.

What’s another technical challenge you’re proud of?

Single sign-on (SSO) and authentication comes to mind. Our customers want to use single sign-on to access our front end. So the challenge was, how do we enable that? I needed to read up on what our options were. This led me to reading a lot about AWS Cognito user pools, identity pools, and different identity provider configurations. We ended up going with SAML for SSO, and designing a standalone UI with separate URLs for each customer.

For API authentication, we needed an authentication scheme more secure than just API keys. We went with a public-secret key pair, where the secret is never sent over the wire. We felt like this was the safest option. We can give a key pair for each integration, which reduces security risk and provides more granular support.

With the challenges I’ve mentioned, it's not like there were always best practices that you just adopt and follow. There were many options. It was about doing a survey of what is possible — exploring and understanding all of those options. I had to pick the option that best met our needs and then determine the path to implement it.

All of our challenges require a lot of problem solving and an investigatory sensibility. You have to be able to holistically examine problems and options for solutions.

What do you want people to know about the Applied Technologies team?

The work we take on is incredibly diverse. Our team owns the entire stack for our platform. An engineer might spend one day working in the Python application itself. The next day, they might be working on a data workflow in Airflow. The next day, they might be deploying new AWS infrastructure using terraform and CI. There’s a big range.

We’re not sales engineers. This isn’t a role where you take a pre-built solution and deploy it. We build the system. In my case, we built the API, the platform, and the architecture for our customers. If you join the team, you have ownership over the platform and the roadmap ahead. Likewise, if there are problems, we fix them. This is a team for people who want to be actively building software.

People who enjoy solving problems do well here. We look for people who are not afraid to ask questions, to probe into systems, to test their assumptions. Someone who’s curious to look into things and understand systems well enough to make better recommendations will thrive here. It’s a lot of responsibility for those who want it.

If you’re interested in learning more about career opportunities with the Applied Technologies team, check out our current openings.

Introducing Business Closures Data

Enigma — Wed, 15 Jul 2020 00:00:00 GMT

Update: Since this post's original publication, Enigma has retired the Business Closures attribute. We have since introduced the Transactions Stability attribute, which can also be used to identify closed businesses.

What kind of signal does Enigma Business Closures Data provide?

Business Closures data is a leading indicator of business risk or distress. Since the data reveals whether or not a business is operating week to week, this is going to be an early signal that a business may be at risk.

What makes Business Closures data unique?

There are three key factors:

Timeliness: It takes a credit bureau data anywhere between 1-3 months to let you know if a business is in distress. This latency increases your risk exposure and prevents any kind of proactive intervention. In contrast, Enigma Business Closures data captures the operating statuses of businesses in near real time, ensuring you’re rapidly alerted to any new closed businesses.
Diversity of sources: Since data on small businesses isn’t easy to find, Enigma Business Closures spans across a wide range of online sources, including local news websites and online business listings. We also include data from a number of third-party providers.
Transparency: We want to share the sources that indicate closure directly with you so you can verify these predictions. We strive to reduce the black-box nature of machine learning as much as possible.

How does Enigma model this data?

Enigma has trained a number of natural language processing (NLP) models to detect indicators of a business’s permanent closure across online and third-party data sources. Each week we analyze a variety of sources to get the most up-to-date and accurate operating status for each business.

How are organizations using this data?

Business Closures data has proven to be essential for portfolio risk monitoring, particularly during this period of economic volatility caused by COVID-19. Our customers are using this data to flag any businesses within their portfolios that transition to a “closed” operating status. With this data, financial institutions can react to potentially at-risk businesses more quickly and confidently monitor their overall risk exposure.

Marketing teams have also leveraged Business Closures data to increase campaign ROI by cleaning prospect lists and ensuring they’re only targeting open businesses.

If you’d like to learn more about our Business Closures data, you can try it for free via our API, or get in touch to request a demo.

Enigma Shipping Imports data reveals a 12% drop in imports in 2020

Nick Hershey — Mon, 13 Jul 2020 00:00:00 GMT

As of 2016, every single container that’s shipped to the US is documented in the Automated Manifest System, as mandated by the Importer Security Filing Law.

The law requires that before loading goods onto an ocean ship headed for the United States, the importer and carrier must electronically transmit data on each container to the United States Customs and Border Control. Each record is made public immediately and details information about a container being shipped to the US, including shipper and consignee info, as well as a description of the goods in the container.

Enigma makes Shipping Imports data available through our API. The data is useful for understanding how countries are dependent on one another for certain goods, and what businesses or industries might be vulnerable to things like country-wide sanctions, trade wars, or more recently, a global pandemic.

Given the economic effects of COVID-19, we decided to look at how the pandemic has affected U.S. shipping imports. Overall, imported container volumes are down 12% when we compare the first half of 2020 to the first half 2019. To put it concretely, countries had exported more than 10M shipping containers to the US by June 30th 2019; for the same time period this year that number is less than 9M.

We also looked at individual exporting countries. The US imports more than 4 times as many goods from Mainland China than from any other nation.. China is known to be the earliest country to be affected by COVID, and US shipping imports from China have decreased this year by 10%. However China has actually increased its share of exports to the US by 2% overall, as other countries have experienced large declines in their proportional share of imports.

If you look at Italy, the 5th largest exporter to the U.S., its share of exports to the U.S. is down 10%, and its shipping container volumes are down 20% this year. Given that all of Italy was shut down for weeks this spring, these numbers are drastic but likely realistic.

Interestingly, Hong Kong’s exports to the US were down by 27%, even though the region never experienced a full shutdown this spring. This decrease is in part due to the fact that Hong Kong’s exports were already declining in January due to a “tough trading environment”. This decline was then further exacerbated by COVID-19. You can see in the chart below that Hong Kong’s share of exports to the US are down by 18%.

It seems unlikely that imports to the US will rebound in 2020 given the global trade climate and a struggling US economy that may have lower appetite for exported goods in the near term.

As Enigma refreshes our data week to week, you can test out our API to track trends and understand the activities of specific businesses. If you come across additional interesting insights during your explorations — or have questions about how this data can be used effectively — we’d love to hear from you.

Industry Data Is Essential and Yet It’s Often Inaccurate. Here’s Why.

Pam Wu — Tue, 09 Jun 2020 00:00:00 GMT

It’s hard to predict a business’s risk without knowing its industry. Consider if a business called Self-Sufficient Coffee Co. applied for group accident insurance: if you were that insurer’s risk officer, you would naturally be concerned whether the applicant was involved in mass producing coffee, serving coffee, or both.

Industry data provides insight into how a business makes money — for example, whether a business sells a good or a service as opposed to manufacturing goods. This data helps financial institutions make smarter decisions about whether to engage with a small business, as well as how to market to, underwrite, and onboard the small businesses they serve.

Although industry classification is essential, accurately defining a business’s industry remains challenging. Why? In general, industry data is complex, and the granularity of available industry data varies significantly. In this post I’ll break down precisely why industry classification has been hard to achieve, and thus resulted in years of inaccurate or vague industry data.

Breaking down the challenges of industry classification

Selecting a business’s industry is surprisingly difficult.

When we ask for industry data, we’re really looking for insight into how the business operates, and how they make money.

Consider our hypothetical coffee shop, Self-Sufficient Coffee Co., a local hangout that sources beans, roasts in-house, and sells coffee both by the cup and in take-home bags. In the morning, the shop bakes its own pastries to accompany the coffee it sells. Depending on the activity you focus on, Self-Sufficient’s industry classification could be a specialty drinks business (NAICS 722515 – Snack and Nonalcoholic Beverage Bars) or a manufacturer of coffee (NAICS 311920 – Coffee and Tea Manufacturing). Which is correct?

One way to define a business’s primary industry is to use a quantitative standard. You could determine the business’s industry by selecting the aspect of the business that generates the highest amount of revenue or profit. Sometimes these can be at odds with each other: the coffee bar could make the highest amount of revenue, but because coffee roasting can produce large volumes of beans that are sold to other stores, coffee roasting could actually make a higher profit than the bar.

Sometimes these factors change: if the manufacturing side expands, it may result in the coffee shop’s industry changing to manufacturing, even if it’s still better known as a coffee bar. The intangible factor of what a business is known for, or even what the owner intends the business’s industry to be, is always disregarded. These factors can make it difficult to validate industries from the outside. Even if you were a regular customer of the establishment itself, how would you know that this coffee shop had quietly switched industries unless you had access to its balance sheets?

Classification systems require tough decisions.

NAICS is an industry classification system that uses 6 digits (1-9) to convey taxonomic information about industry. The first digit says if it provides a good or a service, the second gets into more specifics, such as if it’s in manufacturing, retail, wholesale, construction, etc. The third through sixth digits break each previous digit’s definition into up to 9 sub-categories. Sometimes the 9th sub-category ends up being miscellaneous (as an example, NAICS 713990 - All Other Amusement and Recreation Industries simultaneously encompasses laser tag, dance halls, and riding stables). This system makes sense on paper, but reality often throws a curveball.

You may have noticed something in the coffee shop example up above: one of the example NAICS codes began with 311 and the other with 722. NAICS 311920 – Coffee and Tea Manufacturing is closer in the taxonomy to NAICS 311811 - Retail Bakeries than to NAICS 722515 – Snack and Nonalcoholic Beverage Bars, where “snack” is an umbrella concept that covers baked goods and “non-alcoholic beverage” is one that covers coffee.

Depending on how you think about industry, this may or may not sound intuitive. This is because industry is comprised of a number of factors, including: its inputs and outputs (i.e. a bakery takes flour, water, yeast, etc. and converts it to bread and pastries), how the input gets converted to the output (i.e. was it by machine or by hand), how it gets delivered to the customer (sold over the counter, transported to another location, etc.), and what kind of customer do you serve (retail establishments, direct consumers, banks, etc.). There are even unaccounted for factors in most industry classifications, such as the target market segment. Each of these factors can connect companies together, but if you want to make a taxonomy, you have to decide which factors will split the categories earlier and which ones will split later.

Data granularity is highly variable.

Every use case for industry data requires a different amount of detail, also known as “granularity.” It’s challenging to provide industry data that provides enough insight for most users, without becoming overwhelming. There are two ways to think about granularity: the detail of industry-level data available and the detail of industry-level data needed for a specific use case.

Again, let’s consider Self-Sufficient Coffee Co. If you’re determining whether to lend to the coffee shop, it may be enough to know that they are in the food services industry because you believe that food services businesses perform well enough to qualify for a loan. However, if you’re determining whether to insure the coffee shop, Self-Sufficient’s in-house roasting activities will likely affect your risk evaluation.

This desired level of industry granularity still has to contend with what data is available, presenting potential trade-offs between accuracy and granularity. If you require a more detailed level of granularity about a business’s industry than a data provider can offer, their accuracy becomes irrelevant. Likewise, a provider’s specificity of industry data doesn’t mean anything if they can’t deliver accuracy at that level of granularity as well.

A new option for industry classification data

We want to change how industry codes are being predicted for small and medium businesses across the U.S. Our classifiers deliver 2-3X better precision than incumbent data providers. Providing this accuracy and high levels of granularity when needed means we’re able to provide more powerful industry data than what’s been available to date.

We’ve detailed our approach to industry classification data in this earlier blog post, but you’re also welcome to explore Enigma’s industry classification data for yourself via our API.

The State of Small Business Lending: June 2020 Industry Poll

Juliana Sullam — Fri, 05 Jun 2020 00:00:00 GMT

Financial institutions that lend to small businesses have become “economic first responders”, as these institutions are working hard to resolve the extreme financial challenges presented by COVID-19. It’s no secret that many small businesses are in distress, and the destabilization they’re experiencing is sending shockwaves back to the institutions that serve them.

We decided to check in with the small business lending community to learn what’s happening on the ground at their institutions. What are these institutions experiencing right now, and what do they predict will happen in the next six months? Over the past week, Enigma polled 30 professionals across risk, credit, underwriting, and SMB business lines to get their takes on what’s happening in the world of small business lending. The poll results are summarized below.

The impact of COVID-19 on SMB lending varies widely.

There was no singular dominant response when we asked participants about how COVID-19 affected their institution’s loan or credit underwriting for small businesses. The responses instead revealed that institutions are responding to COVID-19 differently and with varying degrees of intensity.

Perhaps most heartening was the fact that more than 50% of institutions are maintaining or increasing lending to small businesses, even exclusive of Paycheck Protection Program (PPP) loans. This signifies that many institutions are able to continue to provide small businesses access to credit and capital.

At the same time, more than 40% of participants noted that their institution was moderately or significantly reducing loan or credit offers to small businesses. Thus a good number of institutions are operating more conservatively right now to reduce risk exposure.

Constrained lending behavior may not last long. About ⅓ of participants believe their institution will return to pre-COVID levels of lending activity in six months. This positive outlook is further reinforced by the fact that few participants think their institution will limit or reduce their small business customer portfolio.

While lending activity may bounce back relatively quickly, 40% participants do believe credit or loan eligibility standards will become more stringent for small businesses in the future.

One participant stated, “The real issue is whether our credit standards will be somewhat tightened given current circumstances, by, for example, requiring lower LTV's, higher minimum debt service coverage, more equity in development projects, and stronger personal guarantees from owners and sponsors.”

Everyone agrees delinquencies will rise.

97% of participants believe small business customer delinquencies will increase. This overwhelming response is unsurprising given the widely-known statistics about small business distress. Our participants are likely watching this data play out in real time amongst their institutions’s small business customer base.

However, the majority of participants (70%) believe delinquencies will only moderately increase, as compared to 27% that said delinquencies will increase significantly. This could reflect that institutions believe small businesses are poised to recover, perhaps thanks to government interventions such as the Paycheck Protection Program.

Note that only 23% of participants believed their institution would incur significant losses due to small business distress or delinquency. We attribute this either to the respondent institutions not being over-leveraged across their small business portfolios, or perhaps feeling that these delinquencies may be resolved as soon as small businesses are able to start operating more normally.

1 in 3 institutions is interested in new data sources for small business credit risk.

When asked about looking into new data sources for small business credit risk, more than 25% of participants said their institutions were currently exploring or planned to explore new sources of credit risk data. Another 20% of respondents said their institutions may look into new risk data sources in the next six months.

Appetites for new data ranged widely, but two distinct trends were apparent. Several participants mentioned that industry data would be valuable for assessing credit risk. One participant noted that “We will seek out additional industry data for at-risk borrowers”, whereas another noted that data covering which “Industries that will rebound or be resilient in [the] current environment” would be useful.

The other data trend? Real-time signals, or more timely data about how small businesses are faring given the impact of COVID-19. Participants repeatedly mentioned that more timely insight into a business’s revenue, cash flows, and other shorter-term signals would be valuable. One participant noted a need for “real time data rather than the typical bureau data that lags up to several months.”

Looking ahead

Our poll revealed that financial institutions are experiencing COVID-19 in diverse ways — and that they're applying different strategies to resolving COVID-19 challenges. We’d love to know if the poll results resonated with you and reflected what you’re seeing at your own institution. Please reach out if you have feedback or additional insights to share.

By providing small business data that includes near real-time insights about small business credit risk as well as detailed industry data, Enigma is a resource at a time where better signal on small business risk and resiliency is urgently needed. If you’re interested in learning more about our small business data, feel free to get in touch or create your own account to explore our data right now.

Announcing Enigma’s SOC 2 Certification

Enigma — Fri, 29 May 2020 00:00:00 GMT

Enigma achieves SOC 2 certification

We are excited to announce that Enigma has obtained its SOC 2 Type I certification. Enigma has always taken the security of our customer data seriously, investing in the development of carefully designed cybersecurity and data protection controls for years. Now, with this certification, we’re making it easier for customers to have confidence in our security program.

Not only did we obtain our SOC 2 Type I certification - we did so without a hitch. Our report is a “clean report,” meaning that our auditors did not identify any exceptions or issues relating to our controls.

What is SOC 2?

SOC 2 is an internationally-recognized framework developed by the Association of International Certified Public Accountants (AICPA). It is increasingly regarded as the gold standard for validating software companies’ security compliance. To obtain a certification under this standard, auditors thoroughly evaluated Enigma’s information security, site availability, software development, HR, legal and finance practices.

Why does SOC 2 matter?

Companies rely on Enigma’s data to build models and make more informed decisions about everything from underwriting to customer acquisition to risk management. We recognize that alternative data is especially important right now, and our SOC 2 certification will make it easier for financial institutions to integrate the data they need from Enigma. We are committed to providing our customers the transparency they need, both about our data attributes and our security processes, to focus on better serving small businesses.

We are happy to share our report with current customers or prospective customers under NDA. If you’d like to obtain a copy, please reach out to security@enigma.com.

How Enigma Develops Small Business Data

Enigma — Thu, 21 May 2020 00:00:00 GMT

“It’s kind of a black box.” We hear this frequently from small business lenders about the data they’re using and what they know — and more often don’t know — about how this data comes to exist.

While opacity may be the norm for incumbent data providers, at Enigma, transparency is an operating principle. We believe you should have access to information about the data you’re evaluating and the processes through which this data was developed. This visibility is essential for understanding the strengths and weaknesses of a data provider and their approach to development.

In the spirit of transparency, here is a step-by-step overview of our process so that you can see everything we do to transform raw source data into data attributes you can access instantly.

Step 1: Define the Data Attribute

In layman’s terms, a data attribute is a piece of information about a company; it’s one piece of the puzzle.

A data attribute is a broad category, and may have multiple data points beneath it. For example, “industry” is a data attribute which includes NAICS codes, an industry text label, and flags for whether the business partakes in certain activities like ecommerce.

Ideas for new attributes often surface from conversations with current customers.

“Customers are key in our process. Ongoing inbound attribute requests from customers help us build & prioritize our roadmap,” said product manager Nick Hershey. “We have a working relationship with every single one of our customers.”

In addition to customer requests, Enigma constantly explores new data sources that will help financial institutions gain a clearer picture of the small businesses they serve.

Developing new data attributes isn’t just about understanding a small business today; it’s also about helping customers predict where the small business will be tomorrow. This is where our data science edge really becomes evident.

“The raw amount of data science horsepower we have here is very special,” fellow product manager Jordan Dominguez, stated. “We have people who have PhDs who are focused on attribute forecasting. I think that that's unique to Enigma.”

Step 2: Research Data Source

We always explore a variety of data sources to determine how to deliver the highest quality attribute, unlike traditional data providers.

Pam Wu, Head Data Scientist at Enigma, explained our distinctive approach, “For us, data sourcing is key. The incumbent providers tend to use self-reported data and more standard sources. We’re able to be more creative, and only offer data that we can verify.”

Enigma’s DNA is in public data, so we’re skilled at getting value from complex government sources. We also have an appetite for alternative data, leveraging online data sources.

“Incumbent providers had access to credit histories and utility bills, but in the current climate, those data sources are too out of date for most clients,” Wu added.

Enigma’s approach to data sources allows us to provide fresher, more relevant data than many competitors.

Dominguez adds, “We want to make sure that we're exploring all available resources for an attribute, and that we're vetting that data quality from the very beginning. Sometimes it's not so much a case of finding the one perfect source...you have to triangulate across two or three sources.”

Step 3: Quality Assurance

After the team has identified the best sources for an attribute, we begin a rigorous quality assurance process. The QA process exists to deliver clear, valuable information that customers can use.

To achieve this, Enigma focuses special attention on entity resolution and accuracy.

Entity resolution is all about making sure that the data attributes are connected to the right small businesses. One of Enigma’s key value-adds, according to Hershey, is our ability to tie all the different pieces of data together to a company in a persistent way.

“We persist entities through the entire lifecycle of the business, even as aspects of them (like industry or acquisitions) might change,” Hershey explained.

Accuracy is essential, and we consider accuracy from many different perspectives. For example, how often is the data attribute correct? What aspects of the data might be missing?

“We don’t just rely on self-reported data,” Wu also explained. “We collect our own data, run our own tests, and we are very harsh self-critics. As a result, we see 80 percent or higher precision on our data attributes.”

Step 4: Release

Some data attributes are developed in as few as two weeks, but most are developed in 1-3 months. When compared to the incumbent providers, who can take years to release a new attribute, this is remarkably fast.

When the data attribute is first ready, Enigma releases a public beta version and engages with interested customers to test it.

After launching, the data is refreshed regularly - sometimes every day, depending on the attribute - via our publicly accessible API. Enigma’s API makes it easy to use the data: customers can instantly access the data and seamlessly integrate it into their databases, processes, and statistical modeling without friction.

Each data attribute is ultimately offered a la carte to ensure that customers get the exact attributes they need. As Hershey explained, “Most data providers charge tens of dollars for a report with a lot of information some customers don’t end up using; Enigma charges just cents [per API call] for a data attribute so each customer can get exactly what they want.”

Step 5: Iterate and Expand

At Enigma, developing data attributes is a dynamic, iterative process. This process is designed to not only improve the currently available attributes, but also expand the ways we tell the story of a business, in terms of both quality and breadth of data available.

Our culture of rich customer involvement fuels the improvement of existing data attributes. Feedback from customers enables us to understand how attributes are performing in real-world processes, and how we can make the data more valuable.

The result is fresh, reliable data about small businesses that financial institutions can seamlessly implement into their risk management, monitoring, and readiness processes.

As COVID-19 continues to challenge the small business economy, our ongoing data attribute development, especially our credit risk data, is making it easier for financial institutions to serve small businesses.

“This customer focus, and the speed with which we can spin up data attributes, is more relevant than ever right now with COVID-19’s effect on small businesses,” Hershey said. “Our goal is to build a story around a small business - the full, complete story.”

Introducing Industry Classification Data

Jordan Dominguez — Thu, 14 May 2020 00:00:00 GMT

Understanding what industry a business is in proves to be relevant for nearly every aspect of customer onboarding, underwriting, and monitoring. Without knowing how a business makes its money, it’s difficult to answer any of the following questions:

What is the success/failure rate of similar businesses?
Are the cash flows of this business healthy based on the industry?
Does this company engage in any prohibited or risky activities?
How will this company use my services?
How resilient is this company to economic downturns? How much might it benefit from economic growth?

In short, industry helps put everything else you know about a business in context. It tells you whether a business having a website is common or uncommon, or whether a business is relatively small or large.

The fundamental value of identifying a business’s industry is exactly why Enigma is excited to introduce our new, high-accuracy industry classification data.

Industry classification has failed financial institutions for years.

Right now, nearly every financial institution that serves small businesses cites industry classification as one of its main challenges. When interviewing institutions, we found that they struggled with poor accuracy rates, antiquated taxonomies, and that their industry data lacked granularity and coverage of small businesses.

Rampant inaccuracy

Some financial institutions cited accuracy rates ranging from 25-40%, while others provided examples such as only knowing a business engaged in retail, but not knowing what that business sold. High inaccuracy rates result in financial institutions relying on manual research, creating significant inefficiency.

Systems that don’t reflect today’s businesses

We also learned about key failings in today’s industry taxonomies. NAICS, the standard taxonomy used by most institutions, is expansive and detailed. But it groups businesses in old-fashioned and unintuitive ways, leading to nonsensical groupings of businesses. My personal favorite 2-digit NAICS is “Administrative and Support and Waste Management and Remediation Services”, which covers everything from septic tank cleaning to temp agencies. Other institutions use the GICS industry taxonomy, which provides more common-sense groupings but can lack granularity.

Insufficient granularity

Granularity provides details that are often crucial to understanding if and how you want to work with a business. For example, a construction company can be a one-person plumbing contractor, a home remodeling agency, or a commercial construction company building skyscrapers. Each of these hypothetical businesses present distinct risks and opportunities, and thus getting details that go deeper than “construction company” is essential. 6-digit NAICS codes can provide much-needed detail, but they’re inconsistent in terms of specificity and range from essential to trivial. In other words, across the 1000 6-digit NAICS codes can be the difference between whether a business is a parking lot (812930) or a pet care service (812910); but these codes can also distinguish between minutiae such as whether a company engages in Dimension Stone Mining and Quarrying (212311) or Industrial Sand Mining (212322).

Low small business coverage

Financial institutions have also told us that they have very low fill rates when it comes to identifying industries for small businesses. This low coverage can again result in more manual research, but also potentially jeopardizes the extent to which you can onboard and service small business customers.

Building accurate, powerful industry data

Overall, financial institutions are in a bind - forced to choose between accurate high-level classifications that lack important details, or less accurate and overly noisy granular classifications. We knew that this was a problem we had to fix, especially because accurate industry data makes our other data attributes even more powerful.

We set out to build our own industry classification system based on the following principles:

Unmatched accuracy
A modern and intuitive way of segmenting companies
Details that added insight instead of noise
High coverage

High accuracy through advanced data science

So far our industry classification is achieving accuracy rates of 2-3X higher than incumbent providers. This has major implications for financial institutions — readily-available and accurate data about every business’s industry will help them reduce risk exposure and related losses. Accurate industry data also allows institutions to minimize resources spent on manual research, a significant operational inefficiency.

We’ve attained this accuracy by building predictive models that reflect how a human would classify companies into an industry. We heard from many different companies that they spend countless time and resources plugging business names into search engines, combing through search results and business data aggregators to get detailed information. Based on this, we automated the manual investigation process. Through leveraging online and other public information about business and advanced linguistic models, we’re now able to replicate the human research process and classify industries far more accurately than current providers. Our internal accuracy bar is such that until an industry category is achieving 85% accuracy or higher, we don’t release it.

Classification that makes sense for modern businesses

Based on what we learned from our customers, we built an industry taxonomy that provides common-sense groupings of companies and reflects modern business models. We’re also integrating operations flags to capture how modern businesses operate (more details below).

Details, not noise

Our industry classification data provides as much granularity as a 4-6 digit NAICS code, which maintaining the high level of accuracy detailed above. We focus on providing detail for industries where it’s beneficial, not where it creates additional noise.

The coverage you’d expect from an SMB data provider

Lastly, our industry coverage extends to even the smallest businesses in ways that incumbent providers cannot. On average our fill rates are 10% higher, and we expect that number to get better as our small business data expands.

Real-life example: Enigma's industry classification vs. 3 other providers

This is just the beginning

In the next 2 weeks, our coverage will expand from 20 industries to roughly 50 industries, including 60% of US small businesses. Further coverage expansions are planned in the coming months.

We will also be adding further nuance to industries in the form of operations flags that detail specific business activities about how companies offer goods and services. There’s more to come on this front, but we believe operational details will help our customers get an even deeper understanding of businesses and their associated risks and resiliencies.

Getting started

Enigma’s beta industry classification data is available right now via our API. You can test the data for free, and we welcome your feedback. If you have any questions or would like to learn more about our industry classification data, please reach out.

Quick Take: What We’re Learning from Small Business Lenders Right Now

Kinsey Sullivan — Fri, 08 May 2020 00:00:00 GMT

If current pandemic-induced lockdowns continue through June, more than thirty percent of small businesses may be forced to close permanently, according to a study by Main Street America. While small businesses experience tremendous challenges, this is also an unprecedented time for the lenders serving small businesses during COVID-19.

As we’ve spoken with small business lenders nationwide, on- and off-the-record, clear patterns emerged:

PPP fraud risk is a primary concern for lenders.
Lenders may struggle to balance risk mitigation and growth strategy; today, risk monitoring is the priority.
Access to new, alternative forms of data is critical.

Fraud Risk on the Rise

The Paycheck Protection Program provides valuable resources to businesses in need, but lenders are concerned about fraud- and fairness-related risks.

When it comes to PPP, time pressure, lack of processes, and application volume combine to increase risk exposure. In 2019, the SBA issued $28B total. Now, lenders are being asked to issue more than $500B on behalf of the SBA in just 8 weeks. One bank received 53,000 PPP applications on April 3, the day after the program launched.

Policies and processes are being created and tested in real-time, while existing systems may not support demand. Some lenders have had to rely on manual verification processes, although such efforts are unlikely to effectively manage risk.

Risk concerns have proven to be well-founded. On May 5, news broke of the first PPP fraud-related arrests, while separately, Wells Fargo is now being investigated for its PPP loan management.

To manage risk, one Director of Compliance with 15+ years of experience in BSA/AML and Fair Lending recommended that lenders establish compliance procedures and outline who is responsible for managing PPP compliance.

Some lenders also expressed another fraud-related concern: namely that, in a pinch, consumers will use their higher-line business credit cards instead of their personal cards.

Risk Mitigation Efforts Supersede Growth Goals

With the future of millions of small businesses in jeopardy, lenders aren’t sure what percentage of their portfolio will be impacted - or how much. As a result, many lenders aren’t focusing on attracting new SMB customers. Instead, they’re focused on monitoring their existing portfolio and reducing exposure to risk and loss.

When growth does become a priority, lenders will focus on businesses and industries that have shown themselves to be resilient. Experts also anticipate that the process of approving small business loan applications will transform. In the past, lenders wanted to attract and approve small business loan applicants. Post-COVID-19, lenders will be more selective about approving applications, setting more stringent parameters for loan or credit qualification.

Access to New Data is More Important

For monitoring, underwriting, and recession readiness processes, access to new types of data is now critical.

“Inevitably, banks will have to adjust their data and methodologies to reflect the new normal,” according to a McKinsey article.

Credit bureau reports aren’t enough anymore; lenders will need to use a variety of sources, new data points, and fresher, more frequently updated data to get a clear picture of risk and opportunities.

“Need of alternative data or alternative bureau data, along with smart capabilities, has taken the center stage in this pandemic-driven recession,” said Petal’s Chief Risk Officer Kaustav Das in a recent interview with Enigma.

Das also highlighted the need for data freshness. “You cannot rely only on bureaus, you need more real-time data like bank data, processing data, or other alternative data as that captures recency.”

Lenders shared that this data will be valuable in the longer term as well, not just for COVID-19 recovery.

To our lender community: we’re curious to hear about your experiences in SMB recovery and risk mitigation, and whether the points above resonate. Please get in touch if you have a point of view to share.

Enigma has made resources and new risk signals available to lenders; to learn more, visit our COVID-19 hub.

Introducing Business Bankruptcies Data

Lior Zichron — Fri, 01 May 2020 00:00:00 GMT

We had a surprising realization while developing our COVID-19 risk signals: despite being the most official indicator of business distress, bankruptcy data has not been widely integrated by lenders.

For loss mitigation, bankruptcy data serves as the last line of defense. Even the best underwriting and monitoring programs will inevitably overlook some at-risk businesses. At the same time, some distressed businesses will fail to inform their lenders about bankruptcies. Both scenarios leave lenders exposed to loss. Timely bankruptcy data enables lenders to react while they still can, by surfacing risky businesses, preventing credit line misuse, and minimizing losses.

It’s with all of this in mind that we’re releasing our Business Bankruptcies data attribute today. Enigma’s business bankruptcies data originates from court records and updates daily. The attribute’s high refresh rate enables lenders to track bankruptcies in near real-time.

We know that using this data successfully requires being able to identify your customers within the bankruptcy records. We’ve leveraged our proven entity resolution capabilities to specifically ensure that no bankrupt business in our customers’ portfolio will go unflagged.

Our zero-latency business bankruptcies data is available now and offers national coverage. Sign up for access via our API, or contact us for a free trial.

Discussing SMB Credit Risk and COVID-19 with Kaustav Das, Chief Risk Officer of Petal

Kinsey Sullivan — Wed, 22 Apr 2020 00:00:00 GMT

Kaustav Das is the Chief Risk and Analytics Officer at Petal, an alternative credit card fintech. Previously, he served as the Chief Risk Officer at Kabbage, and spent 15 years in credit and fraud risk at American Express.

We spoke with Kaustav about how risk is evolving during COVID-19, the experiences of small business lenders, and the future of the SMB lending landscape.

How do you think risk will evolve in the aftermath of COVID-19, both in terms of how risk is defined, but also in terms of how policies and practices will change?

The words of a top bank’s CEO summed it up when he said that he wished there was a handbook for a pandemic. Everyone prepared for a recession, but no one prepared for a pandemic, or a recession that's driven by a pandemic. We have recession readiness and then we have recession planning. The pandemic is going to result in a lot of changes in the risk landscape and will change the risk playbook for the entire customer life cycle.

As an example, understanding risk by industry has always been important, but now people will pay special attention to industry subcategories. Knowing whether a business is a restaurant is no longer enough, you need to know if they deliver or if they partner with delivery service. If a business is a retailer you need to know if they have a brick and mortar presence or if they’re only online.

Also data freshness becomes increasingly important. You cannot rely only on bureaus, you need more real-time data like bank data, processing data, or other alternative data as that captures recency.

I don't think the pandemic is going to cause a big tectonic shift in terms of the way SMB lending is done, but there will be a major shift in the players that will remain in the market.

How do you think the landscape of small business lending institutions (including fintechs and large banks) will change in the months and years to come?

I think we will see consolidation, but remember, consolidation can happen in multiple ways. Consolidation can happen in the form of someone going out of business, or consolidation can happen when a company becomes a very attractive target for acquisition.

Some of these fintechs have great products, great teams. Larger competitors might want to use this opportunity to get a good bargain and acquire them. Each and every one of these bigger financial institutions will be keeping a close watch. The question is how willing are these institutions to part with cash right now?

Also, the PPP (Paycheck Protection Program)* can be meaningful and make the difference in whether a SMB fintech survives or doesn’t. If a fintech does manage to get a chunk of the PPP commission money, it is going to help them survive. If they don't, there could be a high likelihood that they might regress far behind where they were before this recession.

Another variable in terms of a fintech’s survival is funding — when they were funded, who they were funded by, and how well funded they are.

Not to mention, an additional key factor would be the default rates. Not every institution has the same default rates, and the extent of that impact would play a very important part.

Has your thinking around small business risk attributes and data points shifted? If so, how?

My thinking around SMB risk attributes and data points has evolved. Need of alternative data or alternative bureau data, along with smart capabilities, has taken the center stage in this pandemic-driven recession. How does someone assess the extent of revenue loss, or when the PPP check or stimulus is getting deposited? Every institution suddenly has started appreciating the need of having robust KYC/KYB capability for PPP. Document fraud suddenly has come to the forefront.

I'll give you a specific example of a data attribute that’s evolving: Industry. Some financial institutions will have a knee jerk reaction and will stop lending to restaurants as they are highly impacted. Other institutions that have access to better data, would figure that, restaurants are impacted, but some restaurants’ business is only down 25-30%, not even 50%. And guess why? The amount of online orders have shot up.

So, yes, people are not doing fine dining, but these people - or a portion of them - are ordering in. So as that dine-in portion of the revenue is going down, take out orders have significantly gone up to offset some of the lost revenue.

This just goes to show that even if you've identified an industry, such as the restaurant industry, that categorization alone is not enough. The next level of detailed categorizations, such as does a restaurant deliver or is it enrolled with any delivery partners, becomes important. The data - or more accurately, the granularity of data - becomes important.

What advice would you give to financial institutions thinking about their risk now?

The time to act is now. For existing customers, think of smartly reducing contingent liability. For new customers, weigh risk reward with a new lens. You need to quickly change your risk policies, and procedures. Models that worked perfectly before may not be as accurate or viable. As a new model cannot be changed, tested, approved, and implemented so quickly, you have to resort to smart risk policy and strategy changes.

*Editor's note: As of 4/16, the initial $349B in funding allocated for PPP rescue loans had run out. As of 4/22, another infusion of funding is expected to be approved in the next few days.

Financial Resources to Help Small Businesses Weather COVID-19

Madeline Ross — Mon, 30 Mar 2020 00:00:00 GMT

In the wake of COVID-19, small businesses are facing unprecedented challenges. As part of our effort to support the SMB community, Enigma has aggregated a list of financial resources to help small businesses from the fintech community and beyond.

This list was last updated April 6, 2020 and will be updated as new resources become available.

Small business lender Kabbage has launched a platform where any business can sign up to sell gift certificates online, and anyone can purchase them to support participating small businesses.
Hello Alice is offering $10,000 grants to be distributed immediately to small business owners.
Many states are allowing deferments of tax payments and/or extensions for filing. The AICPA is keeping an updated list of state changes here.
The federal government has announced that individuals who owe $1 million or less and corporations that owe $10 million or less will have an extra 90 days to pay their 2019 taxes.
The SBA is offering low-interest disaster assistance loans for up to $2M for small businesses affected by COVID-19. You can find a guide to their loan resources for small businesses here.
Facebook is rolling out a $100M grant program for small businesses. You can sign up to be notified when applications open here.
Yelp and GoFundMe are partnering to enable local businesses to receive donations directly on their Yelp pages, as well as matching up to $1 Million in donations.
Fundera has put together this helpful list of state-by-state resources to help small businesses affected by the coronavirus.
Square has a resources page that includes suggestions for how small businesses can safely operate in the wake of new restrictions
Lendio has published a guide to SBA COVID-19 loans.
Toast has created a multi-chapter guide for restaurants to manage during COVID, and plans to release a weekly newsletter including a round-up of relevant articles, tips, and case examples of restaurants that are managing.
Gusto has put together an employer’s guide to navigating coronavirus.
Aspen Tech Policy Hub of the Aspen Institute has announced a challenge grant: "In light of the ongoing COVID-19 pandemic and the need for civic technologists to build new tools and policy solutions to support communities in need, we thought it was imperative to pilot this model and announce one such opportunity now."
Crunchbase has published a list of how sales and marketing teams can use their resources during COVID-19.

Are you a financial institution trying to better serve your small business customers through this period? At Enigma, we’ve built one of the most comprehensive sources of alternative data on small businesses, and the infrastructure to rapidly develop new models and insights.

If you’re looking for alternative data to better serve your small business customers let us know.

Why You've Never Had Accurate Small Business Data Until Now

Kinsey Sullivan — Thu, 26 Mar 2020 00:00:00 GMT

Until now, most data about small businesses has been of poor quality; the data is typically outdated, inaccurate, and sparse in coverage.

Bad small business data presents a big problem. Without fresh, accurate data, financial service providers can’t understand their customers and thus can’t provide solutions that serve the varying needs of small businesses. This leads to less capital distribution, burdensome customer experiences, and limited product choices for small businesses, which represent close to 50% of US private-sector GDP.

In our data-driven world, why is it still so difficult to get accurate data about small businesses?

There are many reasons why getting accurate data about small businesses has been difficult, but they can be distilled down to three key challenges: data freshness, data accessibility, and the tremendous diversity of small businesses.

Data Freshness

Small business data is often outdated, sometimes by as much as a year. Small businesses are also dynamic -- some grow or fail so quickly that the information they report is not accurate for very long. Small businesses also face fewer regulations, so they have fewer reasons to report data. Data latency is also exacerbated by long lag times from both public and private data providers, who do not always update their data frequently.

Data Accessibility

Accessing small business data presents challenges; reporting requirements vary state-to-state and business-to-business. Generally, small businesses are required to report far less information than larger businesses. Even when reporting is required, public data is not always digitized for easy consumption.

Diversity of small businesses

There is so much diversity within the 30 million small businesses in the US, yet they’re often grouped as one category. Consider how different a new software startup is from a local grocer with 50 employees, and how these differences would be reflected in each company’s respective data footprint. Small business diversity results in significant variation in terms of what data is – or isn’t – reported by each small business.

All of these challenges are particularly acute for new businesses, those with less than $5M in revenue, and sole-proprietorships.

Enigma has developed expertise in addressing the challenges of data about small businesses, and recently released a new reliable and easily accessible small business data resource.

“We recognize that all small businesses are not the same,” said Craig Danton, Chief Data Officer at Enigma. “We’re building a data set that recognizes those distinctions. We focus on finding more sources of data and keeping them fresher, so you can find information about even the smallest businesses.”

To provide fresh, accurate profiles of small businesses, Enigma draws upon thousands of public, online, and private sources – including new, alternative sources that have never been leveraged for small business intelligence before. Our data processing infrastructure minimizes data latency and enables rapid updates to ensure our data stays as close to source truth as possible.

Reliable small business data is possible, and now it’s available to you. Instantly access our small business data by exploring Enigma’s Businesses API, which provides a range of free and pay-per-call data points.

3 Ways to Use Enigma’s Small Business Data

Juliana Sullam — Tue, 24 Mar 2020 00:00:00 GMT

This week Enigma released our Businesses API. Our new product makes reliable data about the 30M small businesses in the U.S accessible for the first time.

Through industry-leading data science and proprietary machine learning, we’ve transformed thousands of online and offline data sources into a single point of integration for all of the alternative small business data and insights you need. Our data details everything from contact information to corporate registrations, to industry classification and more, enabling you to deeply understand your small business customers.

In a time where reliable small business data matters more than ever before, Enigma’s data is designed to help organizations service and support small businesses more effectively than ever before.

Getting started with our Businesses API is simple. Just create an account and you’ll get free access to unlimited use of our standard data.

There are many processes across both growth and risk-oriented initiatives that benefit from our data. To get you started, we’ve outlined three ways you use our small business data:

Enrich businesses in your database

Enrich incomplete SMB leads within your database or CRM by integrating Enigma’s data to build complete profiles of every small business lead we match in your database.

Segment leads more effectively

Use Enigma’s data to segment and target small business leads using new and more accurate dimensions. Predictive insights also enable you to prequalify leads based on your custom criteria.

Tailor your outreach

Develop more personalized campaigns with relevant offerings and tailored messaging based on all of the new information you have about your small business leads and list segments.

These are just a few ways in which our data can make your small business processes smarter. Interested in learning more about our data? Get instant access today by creating a free account.

Reliable Small Business Data Matters Now More Than Ever

Juliana Sullam — Mon, 23 Mar 2020 00:00:00 GMT

Enigma’s mission is to engineer the most reliable and accessible source of data about small businesses.

Today, we’ve taken a big step towards realizing this mission with our release of the first free, accurate, and easily accessible data source about U.S. small businesses, with a particular focus on those with $5M in annual revenue or less. This data has been in development for some time as we’ve long heard from financial services providers that the lack of reliable small business data has, to date, left SMBs both underserved and undercapitalized.

Small businesses are often cited as the backbone of the American economy, and yet, because of poor, often sparse data, they’ve not been well understood, even by the organizations who serve them. While we most often talk about small businesses as a group, the reality is that each one has their own story – from a local independent bookstore to a tenured manufacturing company to a rapidly scaling venture-backed startup – and many of them have suffered as a result of the significant information gap that’s existed for too long.

Right now, small businesses in the United States face tremendous challenges related to COVID-19. While many things remain unclear, it’s undeniable that small businesses will require more support than ever before. For that, the organizations that serve small businesses will need fresh, accurate information to navigate this new economic climate to ensure they’re able to identify small businesses and their industry-specific nuances accurately to holistically redefine what services, products, and capital SMBs really need.

To deliver the most reliable and accessible data on small business, Enigma leverages a vast range of public, online, and private data sources. We’ve honed our data processing infrastructure to minimize latency, ensuring our data updates rapidly, stays as close to source truth as possible, and precisely matches every business to a unique and permanent Enigma ID. Our R&D processes enable our data scientists to acquire new data and extract novel and predictive insights within days. This agility will be essential as financial services providers seek new intelligence to respond to evolving small business needs.

It’s easy to start using our data. Whether it’s to get accurate industry classifications for your entire small business lead base or to verify a business against fraud, you can sign up for an account and start building with our API for free. More advanced data points and predictive insights are available on a pay-per-attribute basis. If you’re curious to see what our data is like, look up your favorite small business right here.

What is Small Business Data?

Kinsey Sullivan — Mon, 09 Mar 2020 00:00:00 GMT

Successful marketers know that high-quality leads or prospect data is the foundation of any effective strategy.

In this blog, you’ll learn what small business data is and why it’s useful for B2B marketing.

But first, what is small business data?

While definitions of small businesses can vary, small business data is typically defined as information about businesses that have less than $50M in sales and fewer than 500 employees. Small business data encompasses a range of data points, all of which help you better understand a given company.

At Enigma, we focus on small businesses with less than $5M in annual sales. Nearly 90% of all small businesses fall into this category, but data about these businesses has historically been sparse and inaccurate.

Small business data includes contact and business information, firmographic data, and registration details. This data is useful throughout the customer lifecycle, from marketing to sales to applicant evaluation to underwriting.

Contact and business information, such as name, phone number, aliases, and website, help marketers connect with the small business and its employees.

Firmographic data focuses on the business itself and is essential for personalization. Examples of firmographic data include a business’s industry, industry NAICS code, years in business or date founded, and corporate structure.

Marketers also benefit from information on registration state(s), registration date(s), and registration file ID to determine if the small business is verifiable and currently in business.

Other types of small business data may appeal to B2B marketers in specific industries.

For example, B2B financial services marketers may use small business tax liens or UCC filings to understand the risk profiles of small businesses. This data may help marketers better qualify and segment their lists based on the prospect’s eligibility for products and services.

Small business data is available through a variety of government sources, such as the US Census, state corporate registrations, and UCC filings, as well as through private providers.

Small business data can be hard to track down, but Enigma has combined thousands of online and offline sources into an accessible and reliable small business database.

See for yourself how easy it is to look up small business data with Enigma - create a free account for API access.

3 Best Practices for Direct Mail Marketing to Small Businesses

Kinsey Sullivan — Tue, 03 Mar 2020 00:00:00 GMT

40 percent of Americans look forward to checking the mail every day, even though nearly 60% of mail the average household receives is marketing mail.

This suggests that Americans like getting the mail in part because of the marketing, rather than in spite of it.

Even B2B buyers respond well to direct mail. Yet many B2B marketers seem to ignore the affinity for direct mail; on average, B2B marketers allocate just 9% of their budget to direct marketing.

These statistics paint a clear picture: B2B direct mail marketing is an underused strategy.

In this blog, you’ll learn three tips for how to do B2B direct mail marketing to small businesses the right way.

To be successful in your next B2B direct mail marketing effort, here are three best practices to keep in mind:

Best Practice 1: Keep It Tailored

In direct mail marketing, personalization is key. It’s attention-grabbing, and more importantly, it helps build trust. Highly personalized direct mail can even outperform targeted digital marketing.

“We've found open rates in excess of 95% for highly personalized snail mail. That sure beats a 15% email open rate,” said Scott Potash, co-founder of leading direct mail company Postable.

Remember, personalization is about more than just the recipient's name or business name.

“When developing personalization for B2B direct mail we try to go beyond the name, company and address,” said Mike Gunderson, President of industry heavyweight Gunderson Direct. "Tailoring copy to a specific industry or job title can make all the difference to get the prospect’s attention.”

Best Practice 2: Make a Splash

Direct mail attracts more attention and results in higher brand recognition than other marketing, but you still only have a few seconds to grab your recipient’s attention.

“People are bombarded with so much digital marketing these days that it takes something unique to cut through the clutter,” said Potash. “That's where we think personalized direct mail comes in - if the piece looks great, speaks to your brand, and is personalized to your customer, you'll get their attention.”

The average response rate for direct mail is between 5 and 9 percent; to compare, the average email response rate is just 1%.

Postcards are a winning tactic. Because they don’t come in an envelope, virtually all postcards are read. Even better, 50% of recipients say they find postcards useful.

Best Practice 3: Use High-Quality Lead/Prospect Data

The data you use in your direct mail campaigns needs to be accurate.

Imagine spending thousands of dollars on a mailer, only to have a third returned because the addresses were wrong. Even worse, misspellings or inaccurate titles can create a bad impression and cause your prospects to lose trust.

On the other hand, fresh, reliable data means you can target your prospects effectively.

“For multi-touch, multi-channel personalized campaigns, it's not uncommon to see response rates in the 8-12% range,” explained Dennis Kelly, CEO at data-focused direct mail company Postalytics. “We're seeing particularly strong performance with campaigns that use automation to trigger direct mail at key steps in the customer journey, including nurture, onboarding and re-engagement campaigns. Properly organized and verified data is where it all starts. Combine that data with the merge tag and dynamic content ability of online direct mail templates, and marketers can deliver highly targeted and relevant messaging via physical media, at scale.”

At Enigma, we understand the importance of quality data for building relationships. That’s why we empower companies to connect with small businesses by providing accurate, up-to-date small business data - for free.

Done right, direct mail is a powerful, modern marketing strategy. The real key to success? Combining powerful, personalized messaging with high-quality data.

Sanctions Screening Benchmarks: Alert Volumes, False Positives, and the Push Toward Machine Learning

Enigma — Sun, 15 Sep 2019 00:00:00 GMT

How does your sanctions screening program compare to your peers? For most compliance leaders, that question goes unanswered. Cross-institutional benchmarks are scarce, and without them it is nearly impossible to know whether your alert volumes are typical, your false positive rates are acceptable, or your roadmap is pointed in the right direction.

This article summarizes findings from Enigma's State of Screening survey, which collected responses from 36 sanctions and AML program leaders at financial institutions ranging from community banks to institutions with tens of millions of customers. The data paints a clear picture: alert volumes are climbing, false positives remain stubbornly high, regulatory scrutiny is intensifying, and machine learning has moved from experiment to mainstream consideration.

Note: The data in this report comes from a fall 2019 survey. The specific percentages reflect conditions at that point in time, but the structural challenges described — high false positive rates, alert volume strain, and regulatory pressure — remain central concerns for compliance teams today.

Who Responded to This Survey

The survey reached 36 compliance leaders across a range of institution sizes and roles.

Institution Size by Customer Base

Customer Base	Number of Respondents
Under 1 million	18
1M – 5M	3
5M – 10M	4
10M – 20M	1
20M – 30M	4
30M – 40M	6

Half of respondent institutions have fewer than 1 million customers. Roughly 38% have 20 million or more — meaning the findings span both community institutions and large financial firms.

Respondent Roles

Title	Number of Respondents
Chief Compliance Officer	17
(Global) Head of Financial Crimes Compliance	8
BSA/AML/OFAC Officer	3
Regional Head of Financial Crimes Compliance	2
Chief Risk Officer	1
Other	5

Chief Compliance Officers and Chief Risk Officers together represent 50% of respondents. The rest come from financial crimes compliance and AML leadership roles. Roughly half of all respondents employ more than 100 full-time employees dedicated to sanctions, PEP, and negative news screening and investigations.

Key Findings at a Glance

42% of institutions are experiencing higher sanctions alert volumes than a year ago
76%+ false positive rates reported by more than half of respondents across both transaction and customer screening
70% of institutions agree they have experienced increased regulatory scrutiny of their sanctions compliance program
36% strongly agree their program experienced increased regulatory scrutiny
81% of institutions are exploring, ready to deploy, or already using machine learning for sanctions screening
17% have already built or deployed machine learning capabilities

Alert Volumes Are Getting Worse, Not Better

The most immediate problem compliance programs face is volume. Alert volumes are primarily one-directional: 42% of institutions report higher sanctions alert volumes — covering both customer and transaction screening — compared to one year prior. Only 14% of institutions definitively reduced their sanctions alerts in the same period.

That means institutions have invested in alert reduction, yet nearly half are still worse off year over year.

Transaction screening generates the most strain. Nearly half of institutions report transaction screening alert rates of 11% or higher. Put that in context: 25% of institutions process at least 1 million transactions annually, and 33% process more than 10 million. At those volumes, an 11% alert rate translates to hundreds of thousands — potentially millions — of sanctions alerts requiring review each year.

Transaction Screening Alert Rates

Alert Rate	Number of Respondents
Under 5%	10
5% – 10%	11
11% – 15%	10
16% – 25%	3
Over 25%	2

Customer screening alert rates trend lower, with 44% of respondents reporting alert rates of 3–5% or higher — still substantial given the scale of most customer bases.

False Positive Rates Reveal a Deeper Accuracy Problem

High alert volumes become a crisis when combined with high false positive rates. Across both transaction and customer screening, the data shows false positives are not a marginal issue — they are the norm.

For transaction screening:

53% of respondents report false positive rates of 76% or higher
About a third of respondents find more than 91% of alerts to be false positive after an initial review

Transaction Screening False Positive Rates

False Positive Rate	Number of Respondents
Under 50%	11
50% – 75%	6
76% – 85%	4
86% – 90%	5
91% – 93%	2
94% – 97%	4
Over 97%	4

For customer screening, the picture is comparable: 50% of respondents report false positive rates of 76% or higher.

Half of all respondents find more than 76% of their alerts to be false positive after an initial "level 1" review. These numbers do not indicate edge cases in otherwise functional programs — they reveal deep accuracy challenges baked into current screening processes.

The staffing implications are direct. To manage alert rates, 61% of program leaders plan to hire full-time employees or bring in external contractors in 2020. Alert volume is not just a technology problem; it is driving real cost and headcount decisions.

Regulatory Scrutiny Is Rising Across All Institution Sizes

Alert volumes exist in an environment of sharpening regulatory attention. Nearly 70% of institutions agree they have experienced increased regulatory scrutiny of their sanctions compliance program over the past year. Of those, 36% strongly agree.

The breakdown of responses:

Agreement Level	Share of Respondents
Strongly agree	36%
Agree	36%
Somewhat agree	19%
Neither agree nor disagree	6%
Disagree	3%

This pressure is not limited to large institutions. The survey found no correlation between institution size and perceived regulatory scrutiny — all sizes reported feeling this strain. Notably, nearly all of the institutions planning to add full-time employees in the coming year also identified as experiencing greater regulatory pressure.

The regulatory context helps explain the urgency. The Office of Foreign Assets Control issued fines totaling a record $1.3 billion in 2019. As one survey respondent put it: "Compliance functions are certainly undergoing increased scrutiny from regulators, particularly in the light of many scandals at larger financial institutions."

Growing alert volumes may themselves reflect heightened regulatory pressure. Institutions may be widening screening thresholds as a way to demonstrate the rigor of their programs, accepting higher false positive rates as the price of demonstrating thoroughness to examiners.

Machine Learning Has Moved From Fringe to Mainstream Consideration

Program leaders are looking to technology — specifically machine learning — as the primary lever for addressing false positives and alert volume. 81% of respondents are somewhere on the machine learning adoption spectrum.

Machine Learning Adoption Stages

Stage	Share of Respondents
Have deployed home-grown machine learning	8.5%
Have deployed vendor-provided machine learning	8.5%
Ready to build or buy machine learning	8.5%
Educating self, team, and/or organization on ML as a possible solution	47%
Have not considered machine learning	19%

The largest group — 47% of respondents — is actively educating themselves and their organizations about machine learning as a solution. An additional 8.5% say they are ready to build or buy. Combined with the 17% who have already deployed ML capabilities (home-grown or vendor-provided), this represents broad momentum toward adoption.

Multiple program leaders listed machine learning as the single change they would make to their current sanctions compliance program or technology. Several specifically cited the desire to use machine learning in initial screening to reduce false positive rates and lower alert rates.

This appetite has regulatory backing. In December 2018, five regulatory agencies including FinCEN released a joint statement encouraging financial institutions to adopt more advanced technologies: "These innovations and technologies can strengthen BSA/AML compliance approaches, as well as enhance transaction monitoring systems. The Agencies welcome these types of innovative approaches to further efforts to protect the financial system against illicit financial activity."

Despite the clear interest, only 17% of institutions have actually deployed machine learning — meaning most institutions that reported higher alert volumes over the past year have not yet benefited from ML-powered accuracy improvements. Alert volumes and related strain are likely to ease materially as more institutions successfully deploy these capabilities.

What the Benchmarks Tell You

The combined picture from this survey is one of significant strain. Screening program leaders face increasing alert volumes, persistently high false positive rates, and intensifying regulatory attention — all at the same time.

The shift toward machine learning reflects where the industry believes relief will come from. Reducing false positives at the initial screening stage addresses alert volume, staffing cost, and regulatory demonstrability in a single move.

For program leaders, these benchmarks offer a few practical points of reference:

If your transaction screening false positive rate is above 76%, you are in the majority — but that does not mean it is acceptable, and peers are actively pursuing ML to close the gap.
If your alert volumes are rising, you are not alone. Only 14% of institutions reduced alerts year-over-year, even with investment in reduction efforts.
If you are under increased regulatory scrutiny, so are nearly 70% of your peers, regardless of institution size.
If you have not yet deployed machine learning, the 81% of institutions actively exploring or implementing it suggest the window for early-mover advantage is narrowing.

Reduce Sanctions Alert Volume Without Increasing Risk

Enigma's approach to sanctions screening is built around the accuracy problem at the root of these benchmarks. Better data leads to higher-quality matches, which means fewer false positives to review, lower alert volumes, and a more defensible program posture with regulators.

Learn more about how Enigma helps compliance teams improve screening accuracy on our KYB product page, or contact us to discuss your program's specific challenges.

How We Solved Our Airflow I/O Problem By Using A Custom Docker Operator

Shuo Cheng — Tue, 13 Aug 2019 00:00:00 GMT

Airflow is a useful tool for scheduling ETL (Extract, Transform, Load) jobs. Airflow runs DAGs (directed acyclic graphs) composed of tasks. These tasks are built using Python functions named Airflow operators allowing users to run tasks across different technologies. Airflow offers a comprehensive suite of standard operators allowing you to run Python scripts, SQL queries in various common database technologies, start up Docker containers, among other tasks. The standard operators can be found here. At Enigma, we use Airflow to run data pipelines supplying data to Enigma Public.

On my team at Enigma, we build and maintain several data pipelines using Airflow DAGs, some of which use DockerOperator to spin up Parsekit (an internal parsing library) containers. In several of these pipelines, we tweaked the Docker Operator to make up for some shortcomings. As a reminder, DockerOperator takes in the image name, volumes, environment variables, Docker url among other arguments, and spins up the specified container. You can think of it as Airflow’s API to running Docker containers as opposed to the CLI. And like the CLI command, there’s no standard method to pass in inputs and extract outputs. This article will show you how to build a custom Airflow Operator to do the following:

Supply JSON input into the Docker Container
Extract file outputs (XLSX, CSV, etc) from within the Docker Container
Operate on multi-worker Airflow deployments

Starting Out

We don’t want to reinvent the wheel here, so we’re going to start our class by inheriting from Airflow’s DockerOperator. DockerOperator takes care of supplying arguments necessary to run the container and starts up the container.

<div class="code-wrap"><code>from airflow.operators.docker_operator import DockerOperator

Class JsonIoOperator(DockerOperator):

def __init__(self, input_task_id, *args, **kwargs):

super().__init__(*args, **kwargs)

self.input_task_id = input_task_id</code></div>

Setup

We need a way to pass input into the container. Ideally, the input comes from upstream tasks. In our case, almost all tasks are Python Operators. The default return of a python operator is stored in Airflow XCOM, allowing downstream tasks to access using the `task_id` and the accessor task instance’s `xcom_pull` function. To get the input, the invoker must pass in the upstream task’s task_id when instantiating the JsonIoOperator. To do this, we use DockerOperator’s __init__ function and supply an additional argument `input_task_id`.

<div class="code-wrap"><code>from airflow.operators.docker_operator import DockerOperator

Class JsonIoOperator(DockerOperator):

def __init__(self, input_task_id, *args, **kwargs):

super().__init__(*args, **kwargs)

self.input_task_id = input_task_id

def execute(self, context):

# pass input logic goes here

# setup output logic goes here

# run the container

super().execute(context)

# load output into Airflow logic goes here</code></div>

Overriding Execute

The execute function is where most of our code lives. We will override the default execute function so we can add I/O logic before and after running DockerOperator’s default execute function.

Our input is a small JSON string. If the input is large (> 1mb), we want supply a file path instead. In standard deployments of Airflow with multiple worker hosts, the file path must exist on a shared storage location such as NFS or S3, which we assume we will have. We will use shared storage later to pass outputs from this task to downstream tasks.

<div class="code-wrap"><code>import json

import tempfile

from airflow.operators.docker_operator import DockerOperator

Class JsonIoOperator(DockerOperator):

def __init__(self, input_task_id, *args, **kwargs):

super().__init__(*args, **kwargs)

self.input_task_id = input_task_id

def execute(self, context):

# pass input logic goes here

input = self.xcom_pull(task_ids=self.input_task_id, context=context)

self.environment['CONTAINER_INPUT'] = json.dumps(input)

# setup output logic goes here

# run the container

super().execute(context)

# load output into Airflow logic goes here</code></div>

Grabbing Input From Other Tasks

Reading in upstream data is easily done using the task instance’s `xcom_pull` method, which is a class method of BaseOperator from which DockerOperator inherits.

To pass the JSON, we have two options: environment variables, and volumes. In my use case, because I have low complexity JSON without special characters, I’m going to serialize the JSON into a string and then set it as an environment variable `CONTAINER_INPUT`. The container process is responsible for reading the environment variable and using it. For more complex inputs, we would want to mount the input file (via shared storage) in and point the container to it via environment variables.

Side note: The task instance context dictionary contains several useful functions and attributes. Here’s a gist listing those out.

<div class="code-wrap"><code>import os

import json

import tempfile

from airflow.operators.docker_operator import DockerOperator

Class JsonIoOperator(DockerOperator):

def __init__(

self,

input_task_id,

shared_dir_path,

output_dir_path=’/tmp/output/’,

*args,

**kwargs):

super().__init__(*args, **kwargs)

self.input_task_id = input_task_id

self.shared_dir_path = shared_dir_path

self.output_dir_path = output_dir_path

def execute(self, context):

# pass input logic goes here

input = self.xcom_pull(task_ids=self.input_task_id, context=context)

self.environment['CONTAINER_INPUT'] = json.dumps(input)

# setup output logic goes here

self.environment[‘OUTPUT_DIR’] = self.output_dir

tmp_dir = tempfile.TemporaryDirectory(dir=self.shared_dir_path)

tmp_dir_path = tmp_dir.name

# appending volume

volume = “{}:{}:rw”.format(tmp_dir_path, self.output_dir_path)

self.volumes.append(volume)

# run the container

super().execute(context)

# load output into Airflow logic goes here</code></div>

Setting Up Output

To access container output downstream tasks, we will mount a shared NFS directory from the host to the container. NFS allows all workers to access the same storage. The base path of this directory should be passed in as an argument as `shared_dir_path`.

We will create a temporary directory within `shared_dir_path` and mount that folder into the container’s `output_dir_path`. `output_dir_path` can be specified by the user. `OUTPUT_DIR` should be read in by the container’s main process and used to write outputs to.

<div class="code-wrap"><code>import os

import json

import tempfile

from airflow.operators.docker_operator import DockerOperator

Class JsonIoOperator(DockerOperator):

def __init__(

self,

input_task_id,

shared_dir_path,

output_dir_path=’/tmp/output/’,

*args,

**kwargs):

super().__init__(*args, **kwargs)

self.input_task_id = input_task_id

self.shared_dir_path = shared_dir_path

self.output_dir_path = output_dir_path

def execute(self, context):

# pass input logic goes here

input = self.xcom_pull(task_ids=self.input_task_id, context=context)

self.environment['CONTAINER_INPUT'] = json.dumps(parser_context)

# setup output logic goes here

self.environment[‘OUTPUT_DIR’] = self.output_dir_path

tmp_dir = tempfile.TemporaryDirectory(dir=self.shared_dir_path)

tmp_dir_path = tmp_dir.name

# appending volume

volume = “{}:{}:rw”.format(tmp_dir_path, self.output_dir_path)

self.volumes.append(volume)

# run the container

super().execute(context)

# load output into Airflow logic goes here

# returns path where output files are written to

return tmp_dir_path</code></div>

Loading Output Into Airflow

After output is written to NFS by the container process, we just return the directory path. Downstream tasks will access the files by reading in the directory path.

Temporary directory cleanups: At the end of the DAG, there should be a cleanup task which deletes all temporary output directories created inside the NFS.

<div class="code-wrap"><code>import os

import json

import tempfile

from airflow.operators.docker_operator import DockerOperator

Class JsonIoOperator(DockerOperator):

def __init__(

self,

input_task_id,

shared_dir_path,

output_dir_path=’/tmp/output/’,

*args,

**kwargs):

super().__init__(*args, **kwargs)

self.input_task_id = input_task_id

self.shared_dir_path = shared_dir_path

self.output_dir_path = output_dir_path

def execute(self, context):

# pass input logic goes here

input = self.xcom_pull(task_ids=self.input_task_id, context=context)

self.environment['CONTAINER_INPUT'] = json.dumps(parser_context)

# setup output logic goes here

self.environment[‘OUTPUT_DIR’] = self.output_dir_path

tmp_dir = tempfile.TemporaryDirectory(dir=self.shared_dir_path)

files = {}

# using context to avoid explicit garbage collection code

with tmp_dir as tmp_dir:

tmp_dir_path = tmp_dir.name

# appending volume

volume = “{}:{}:rw”.format(tmp_dir_path, self.output_dir)

self.volumes.append(volume)

# run the container

super().execute(context)

# load output into Airflow logic goes here

for filename in os.listdir(tmp_dir_path):

filepath = os.path.join(tmp_dir_path, filename)

with open(filepath, ‘rb’) as f:

files[filename] = f.read()

return files</code></div>

Alternative: Use XCOM to Load Output Into Airflow

When the output is small and simple, the following method provides an alternative and loads the output directly into Airflow’s XCOM. This approach is brittle and not recommended but useful in certain scenarios where the output is small. Keep in mind XCOM is a table within Airflow’s database, so all output is stored there. As more and more DAG runs occur, the database will grow in size, necessitating regular cleanup dags to remove Airflow metadata depending on how fast the database fills up.

<div class="code-wrap"><code>from airflow import DAG

from airflow.operators import PythonOperator

from operators.json_io_operator import JsonIoOperator

dag = dag(...)

input_task_id = ‘python_task’

input_task = PythonOperator(

task_id=input_task_id,

…,

)

dockerTask = JsonIoOperator(

docker_url=’unix:///var/run/docker.sock’,

image=...,

volumes=...,

environment=...,

task_id=’docker_task’,

dag=dag,

output_dir_path=..., # location within container output will be written to

shared_dir_path=..., # your NFS dir or S3 location

input_task_id=input_task_id,

)</code></div>

Usage

Where to place the operator?

airflow dags
operators
json_io_operator.py
Mydag.py

Usage is similar to DockerOperator with the addition of three more arguments `output_dir_path`, `shared_dir_path` and ‘input_task_id’. You can check out the DockerOperator docs here.

Mydag.py

Collect Training Data Using Amazon SageMaker Ground Truth & Figure Eight

Ya Zhu — Wed, 07 Aug 2019 00:00:00 GMT

Training data, a.k.a ground truth data, including both observations and the corresponding outcomes, is the prerequisite for building supervised machine learning models. The quality and quantity of the training data often has a great impact on the resulting models, whereas it is not always easy to obtain large-scale and high-quality training data as it sometimes requires humans to annotate the outcome or label of each data record manually. Things become even harder when the labeling task is not as straightforward as distinguishing dogs from cats.

Enigma has been working on entity resolution using machine learning models. We used to collect the training data for the models by kicking off labeling tasks internally. We did it this way because the data used for general entity resolution problems may not fit our product needs, and people who label the data at Enigma usually have the best domain knowledge of the entities we’re resolving. However, the labeling process costs a lot of time and human effort, and is hard to scale. We recently decided to try out some labeling platforms such as Amazon SageMaker Ground Truth and Figure Eight to help us scale our collection efforts. This post introduces how these platforms work, and also describes the preparation and post-processing we have done to complete the training data collection using these platforms, as well as some tips and takeaways. By the end of this post, you will be able to know—

Which platform is right for you
How to create a successful labeling job

Overview of labeling platforms

The two labeling platforms we work with are Amazon SageMaker Ground Truth and Figure Eight. They both allow users to upload unlabeled datasets to the platform along with the instructions of how the data is expected to be labeled, and then the platform will launch the labeling job to let the human labelers in their distributed platform to complete the job. Users can determine the number of times each dataset should be labeled (with different prices). For example, if the number is 3, it means each single record in the dataset will be labeled by 3 different labelers. Both platforms can complete human labeling tasks really fast: a job of 10,000 records with 3~5 labelers per each record can be done within one day. In addition, both the two platforms can also apply built-in models to label the data automatically, reducing the labeling time and the costs of human labelers. Despite the common functionalities, the two platforms have their own characteristics and customizability in terms of the user interface, input and output, workflows, human labeling, automatic labeling and pricing.

Amazon SageMaker Ground Truth (ASGT)

User Interface: Users can manage the labeling jobs within AWS console as well as other Amazon SageMaker features like training jobs, which means the created training datasets can be easily imported into SageMaker for use in model development and training. Users can monitor the labeling progress in real-time from either the console or the output folder on S3.

Input & output: The input and output of the labeling job must be stored in specified JSON format on S3. It looks like only text and image input data are supported for now. The output folder stores both the raw annotations of labelers and the aggregated annotations.

Built-in workflows: There are four built-in labeling workflows on ASGT: object detection, image classification, text classification, and semantic segmentation. Each workflow has its own labeling tool and annotation consolidation algorithm. Users only need to provide input data in the required format and set up the instructions using the AWS console.

Customized workflow: In addition to the built-in workflows, users can launch a customized labeling workflow by creating the labeling interface and the lambda function for annotation consolidation.

Human labeling & labeler types: There are three types of human labelers: private team, Amazon Mechanical Turk and third-party vendors. The private team only contains labelers within the user’s private organization, e.g., a group of employees at Enigma. Amazon Mechanical Turk refers to public human labelers on Amazon’s workforce network. Third-party vendors are those who specialize in data labeling.

Automatic labeling: ASGT creates a model first based on a small set of labeled data given by the user, and then uses the model to label the input data automatically. The data that the model feels ambiguous will be sent to human labelers and the human-labeled data will then be sent back to the model for active learning.

Pricing: ASGT charges one labeling job at a time, based on the amount of objects that are labeled, the type of workflows and the type of labelers.

Figure Eight (FE)

User Interface: FE has an easy-to-use interface for users to upload input data and download output data and reports. The web portal shows real-time progress as well as advanced analytics and plots. Users can determine specific labelers for private tasks and even monitor the progress and performance of each labeler. The labelers can also provide feedback in regards to labeling tasks to the users.

Input & output: FE accepts .csv, .tsv, .xls, .xlsx and .ods formats through the web portal or the RESTful API. Not only text and image, but also videos and audio are supported. The output data also contains both raw annotations and aggregated annotations. Built-in workflows: FE has built-in templates of many popular tasks such as sentiment analysis, search relevance, data categorization, image annotation, speech recognition, data enrichment and data validation.

Customized workflow: FE allows its users to use its WYSIWYG editor to customize the workflow by creating multiple layers of conditional logic, adding custom JavaScript, sending annotated data to in-house models, and customizing data annotation workflow.

Human labeling & test question: FE has a unique phase at the beginning of the human labeling job: all the labelers participating in the job must first complete the test questions which is a small set of ground truth data given by the user. Only those who passed the questions can be allowed to proceed the labeling task, making sure the labelers understand the task precisely.

Automatic labeling: FE also has a similar ML-assisted labeling workflow combining model labeling and human labeling. Users can choose from multiple pre-trained models for different types of labeling tasks. More interestingly, FE allows users to create multi-job processes through the UI using logic-based routing rules between models and jobs to generate aggregated results.

Pricing: FE charges a company customer a flat rate per year based on the estimated amount of rows labeled. They may offer other pricing options.

Preparation

Before kicking off a labeling job on these platforms, we need to prepare the dataset to be labeled, the instructions and examples for the job, and set up templates if using customized workflow.

Data to be labeled

The data we use to generate the training data is from the real-world public data sources we are trying to link. More specifically, the model we are building for our entity resolution framework is to identify the relationship of two given company entities based on their common identifying attributes. This is the initial training dataset for our model, so we want the training data to be evenly distributed in terms of the difficulty level. We also want to cover as many cases as possible and subsample each case in a well-balanced manner. Therefore, we first randomly generated some pairs of entities (note: in practice we are not comparing arbitrary pairs but only interested in pairs that are likely related, but we also need negative samples to train the model), then sampled multiple subsets based on their pairwise similarities on each identifying attribute, making sure we covered different range of similarities and possible combinations. In the end, we actually generated a couple of datasets with different sizes under the same distribution. We need some small datasets to test out the labeling platforms and we did go through some trials and errors before we know how to create a successful labeling job.

Tips:

I assume you can get enough data for labeling on these platforms– if you can't, then it's probably not necessary to use these platforms.
Make sure the data you are going to expose to the public labelers does not contain sensitive information.
Start with small labeling tasks so you can easily refine your instructions and adjust your dataset based on the resulting labels.
If you are collecting training data to improve and existing model, you may want to oversample the cases in which the model did poorly. For classification specifically, you can oversample the cases around the class boundaries.

Instructions

Because these platforms launch labeling jobs at scale, we need to make sure the labelers understand what we are expecting through the job instructions.

The process of preparing the instructions usually starts with exhausting all the possible cases the dataset might have. By looking at the concrete data during this process, we realized that our original expectation of the problem is unclear. We planned to build a binary classifier to determine whether two companies are the same, but there could be more relationships of two companies we want to capture. We finally changed the labeling task to be a multi-class classification problem, which made the number of classes hard to determine and the instructions more complicated: more classes may make it more difficult to draw the boundaries while fewer classes could result more ambiguous cases. We then refined our problem and instructions several times to retrieve the expected labeling results. (See more details in Experiments.)

Since the labelers will only read the instructions for a short period of time, giving concise instructions becomes essential. ASGT allows the users to provide both short instruction and full instruction. Short instruction highlights the most important rules with simple words, and full instruction supplements short instruction with detailed rules and more complex cases.

Examples are highly recommended to be included in the short instructions because examples are better than words, even though we can provide more examples in ASGT’s full instruction or FE’s “test question” phase.

Tips on labeling examples in general:

For categorization labeling tasks, each category should have one or two examples.
The examples should speak for themselves: easy to understand and precisely representative of the cases we want the labelers to deal with.
The examples should include all the edge cases: these are usually the cases we need help from human intelligence the most.
Again, keep in mind that labelers only spend limited time on reading the instructions, so the examples should be as concise as possible.

Tips on providing examples on FE:

With FE, we can provide more examples outside the instructions and we can add comments for each example to further explain why we give a certain label.
Note that the examples and explanations should be consistent with the instructions.
For categorization labeling task, make sure each category is sufficiently represented by the test questions and all categories have balanced distributions.
Keep in mind that the labelers who pass the test questions you generate are supposed to succeed on the larger dataset.

Customized Templates

Our labeling task can apply the built-in data categorization template on FE, but there is no suitable template available on ASGT (the closest one is the text classification template, but the built-in labeling tool can not display our data well), so we have to customize our own workflow on ASGT. There are two parts we need to customize: the HTML interface to display the instructions as well as a pre-processing Lambda function to import the data to the frontend, and the post-processing Lambda function to tell ASGT how to consolidate the annotations.

The HTML interface is easy to customize to display the instructions however we want, and it can import the data values that are defined and parsed from the input JSON file in the pre-processing Lambda function.

Figure 1 shows our pre-processing lambda function template. The input data we provide is in required JSON format for text data, i.e., each line is a complete and valid JSON object where the text data object to be labeled must be the value of source and each source data record must be a text string or a dumped json object.

Our input data looks like as follows, the data object contains multiple attributes of the entity.

The lambda function is used to load and parse the data object to the frontend (see line 10–19 in Figure 1) so that the HTML can load the specific items of the data object (see line 10–11 in Figure 2).

Figure 1. Pre-processing Lambda Function Template

Figure 2. HTML Interface Template

In the post-processing function (Figure 3), we can parse the raw human responses and consolidate them with our own algorithm. We will introduce some common algorithms in the following sections, but here we used the function that is introduced in a Demo Template, which basically integrates all the information available such as the labelers’ IDs and responses for each record into the output file. The benefit of using this function is that it saves full results to the output file during the labeling job and with that we can apply any consolidation functions offline.

Figure 3. Post-process Lambda Function Template

Post-processing

After the labeling job is complete, we need to verify the quality of the resulting labels before applying them for training. For those labels aggregated by the platforms with the label confidence attached, we can subsample the results by different confidence scores to manually check the labeling accuracy. Usually the ones with high confidence scores are safe to use, while the ones with low confidence scores may need to be double-checked. Note that those controversial data records are worth manually check as they are important to model building.

Since the platforms also provide the full annotations of each labeler, we can post-process the full annotations in the way we want. The most intuitive way is majority voting. There are also some advanced aggregation approaches such as Dawid-Skene algorithm and its variations which take the prior reliability of each labeler into account. AWGT also did some experiments to compare the Majority Voting and Modified Dawid-Skene functions, it turns out that the Modified Dawid-Skene is more robust than Majority Voting in dealing with different number of labelers.

Experiments

We did several labeling experiments using the two platforms with different settings in the following aspects—

The platform
The instructions
THe number of categories for data categorization problem
The size of the dataset
THe number of labelers

—to see how the settings of the labeling job will affect the quality of the labeling results.

Figure 4. Results of labeling jobs with different settings

Figure 4 shows the results of different labeling jobs. We check the proportion of agreement of labelers on certain dataset. The blue bar in the figure represents the proportion of the data records that got labeled the same by all their labelers, and the yellow one refers to the proportion of the data records that got same labels from 4 labelers out of total 5, while the red one means only 3 (out of total 5) or 2 (out of total 3) labelers agree on the same label for those proportion of data. The name of the labeling job reflects the settings. “SM” refers to the jobs on SageMaker and “FE” are the jobs on FE. The first number in the name represents the size of dataset being labeled. The same number also means the same dataset. The “xCyL” here means that the data has “x” categories to distinguish and “y” labelers per records. The job whose name ends with “R” has refined instructions.

We can learn from the figure that—

• 3-class caused less confusion than 4-class in our problem.

> e.g. SM_1000_3C5L vs. SM_1000_4C5L

• Our refined instructions cleared up some confusion for the labelers.

> e.g. SM_1000_3C5L vs. SM_1000_3C5L_R

• A larger dataset usually has a statistically higher ratio of agreement.

> e.g. SM_1000_3C5L_R vs. SM_10000_3C5L_R vs. SM_100_3C5L_R

> e.g. FE_100_3C3L_R vs. FE_1000_3C3L_R

• 5-labeler does better than 3-labeler on SageMaker.

> e.g. SM_100_3C5L_R vs. SM_100_3C3L_R

• 3-labeler on FE is comparable to 5-labeler on SageMaker probably thanks to the "text questions" feature of FE.

> e.g. SM_100_3C5L_R vs. FE_100_3C3L_R

> e.g. SM_1000_3C5L_R vs. FE_1000_3C3L_R

Takeaways

Which platform is right for you?

You may want to choose Amazon SageMaker Ground Truth if either of the following is true:

You want to import the data for labeling from other SageMaker or AWS services, or export the labeled data to other SageMaker or AWS services.
You prefer to pay the price per task
Your labeling task fits one of the built-in workflows on ASGT perfectly
You are seeking for specialized human labelers (vendor)
You trust Amazon better, e.g. their built-in active learning model or annotation consolidation algorithms

You may want to try Figure Eight if either of the following is true:

You are not familiar with SageMaker or AWS, and prefer easy-to-use UI
Your data contains audio or videos
Your task fits one of the built-in workflows on FE perfectly
You prefer to have the “test question” phase in the workflow
You prefer to pay the price on a yearly basis

What makes a successful labeling job?

Well-defined problem
Well-distributed dataset, depending on the model’s need
Concise instructions
Representative examples, plus well-explained examples on FE
Iterative refinement of all the above based on the analysis of labeling results
Proper workflow, either perfect-fit built-in workflow or proper-customized workflow

AWS Startup Series: Staying Positive in an Unfunded Startup

Enigma — Thu, 27 Jun 2019 00:00:00 GMT

Hicham Oudghiri, Co-founder & CEO of Enigma, sits down with Lindsay Davis, CB Insights, to discuss the financial realities of starting a tech business and how to get funded.

From maxing out credit cards to doing odd jobs, Hicham shares how he and his remote co-founder were able to bootstrap and re-invest the profits from side-gigs to fund their dream business. He covers how they stayed lean and were able to gain enough traction to eventually catch the eye of their first real investor.

Now, with a number of rounds under their belt and $130M in funding, Hicham shares lessons for founders going at it alone. He talks on the importance of remaining positive, setting realistic goals, and how to stay steady until you’re able to go all-in.

Techonomy NYC: Fighting Global Atrocities—with Data

Enigma — Mon, 10 Jun 2019 00:00:00 GMT

Enigma's CEO, Hicham Oudghiri, talks with Dan Costa, Editor-in-Chief of PC Mag at Techonomy NYC, to share the ways in which Enigma is working to create a model of the world and combat global atrocities like human trafficking with data.

TF-IDF for tabular data featurization and classification

Ben Dilday — Thu, 06 Jun 2019 00:00:00 GMT

At Enigma, the daily work is driven by knowledge discovery across thousands of public datasets. These data sets come from a variety of sources and vary in areas such as data quality, completeness, and formatting conventions. Therefore, categorizing that data is a crucial step in the analysis process.

This post describes some work we’ve done at Enigma to classify tabular data using natural-language processing (NLP) techniques. A key feature of the data we’re working with is a high occurrence of placeholder values, which introduces some interesting differences compared to NLP for human-language data.

Data Types

Data in the wild has a range of difficulty of accurate identification. For example, emails have a strict format, so identifying a piece of data as an email is relatively straight-forward. A somewhat more complex data type is a phone number, which has a less uniform structure. On the far-end of the complexity scale there are more free-form data types such as organization names and street addresses.

Classification of data

Featurization of Cell Level Values

In our approach toward modeling data-type in a database column, we begin by featurizing the cell-level values. Many of these features are Boolean values that capture our human intuition about what’s important in determining the data type of a set of values. For example, in order to help classify emails we might generate features that say, “true or false - this value contains an @ symbol”, or “true or false - this value matches an email-format regex”. For phone numbers we might have features like “true or false - this value begins with a plus sign”, or “this value contains between 7 and 10 digits”. In practice we have dozens of such features that serve to describe a data point.

Database-Specific Complications

One complication of trying to classify database columns is the presence of placeholder values. For example, a column may be categorized as containing phone numbers, but contain a mix of authentic phone numbers and placeholders to signify missing values. If the placeholders are null or an empty string, then they’re straight-forward to identify and handle, but in practice different databases might choose different values, e.g. “—”, “.”, “NONE”, and etcetera. It’s not practical to keep a list of all possible placeholder values; nor would it be useful, since they can vary from system to system.

Useful natural-language processing (NLP) concepts for tabular data classification

Stop Words

The concept of placeholder values in database columns is conceptually similar to the concept of stop words in natural language processing (NLP). That is, words such as “the”, “of”, “a”, etcetera, that are common in documents of any topic and therefore don’t carry any semantic significance. A common practice in NLP is to identify a language / domain specific set of “stop words”, and remove them from the data before doing any additional analysis. In the database application, this would correspond to identifying common placeholder values and removing them before analyzing the column. In principle this is a theoretically justified, and practically useful, step to take, but the complications are choosing a threshold of occurrence frequency for defining a stop word, and in generating new stop word lists for each new database.

TF-IDF

A related analysis tool in NLP is term-frequency inverse-document-frequency (TF-IDF) which provides a concise way of representing the content of a document.

TF-IDF counts a term as being important to the identity of a document the more that term appears - this is the term-frequency component. However, the importance of a term is lessened according to the frequency that it occurs throughout the whole document set - this is the inverse-document frequency component, which effectively discounts terms that don’t have specificity. For example, if a document set comprises biographies of famous mathematicians, and cocktail recipes, the term “proof” may appear in all the documents, whereas a term like “calculus” would be more likely to appear in the mathematics articles. The TF-IDF approach can be viewed as a compromise between the extremes of giving all terms the same significance, and removing common terms entirely, as in the stop-words case (although in NLP work, stop-word removal and TF-IDF vectorization are both typically used). For classification work, using TF-IDF is appealing because it doesn’t necessarily require making a hard decision whether a term is in the stop-words list or not. Additionally, if there are class-specific placeholders, for example “1969-12-31” as a date-type placeholder, then the TF-IDF approach can make better use of that information than an approach that removed the class-specific placeholders as stop words.

More generally, the term and inverse-document frequencies can be thought of as local and global frequencies. Accordingly, there many plausible ways of defining them, and we’ll discuss a few below.

A TF-IDF Experiment for Database Columns

The considerations described above led us to experiment with applying TF-IDF to database columns. As mentioned above, this concisely representing a set of values, and has the advantage of automatically accounting for commonly occurring placeholder values without requiring generating a custom stop-words list.

TF-IDF was developed to analyze human language and there are a number of substantive differences with the applying it to a database. For one, the term frequency distribution tends to be much more bifurcated. That is, for many data types, a term is either unique, or nearly so (phone numbers, IP addresses, UUIDs, etc) or exceptionally common (placeholders). The other primary difference is that the percentage of terms in a document that are made up by placeholders can be much higher than the corresponding percentage of stop words in a natural-language document. It’s not unheard of for a database column to contain greater than 99% placeholder values - the natural language equivalent would be if a Wikipedia article about musical instruments said “the the the the… (thousands of times) … the bassoon clarinet drum”. In the case of classifying data type, even a few meaningful values in a column can be be distinctive enough to allow classification. As shown below, this suggests a sub-linear term-frequency function would be helpful.

A Working Example

To illustrate this concept further, I’ll generate a couple simulated data sets, and run through what the analysis looks like. A Jupyter notebook accompanies this post that includes all the details.

Simulated data

Simulated Classes

The classes I simulate are:

email
phone number
UUID
float number
binary number

For email, phone number, and UUID, I use the Faker Python library, which provides generators of synthetic data for a variety of data classes. For float numbers, I generated random values from a uniform distribution in the range -180 to 180. For the binary numbers, I generated a sequence of 0s and 1s, each with a probability of 1/2.

The email and UUID classes represent data types that are relatively easy to identify, and the phone number class a data type that’s less easy to identify. The float and binary number classes represent more generic data types.

Simulated Placeholders

As simulated placeholder values, I use “-”, "“, and”0“. The frequency of placeholder values in each column is 90%. Note that”0" is both a placeholder in some contexts, and a valid “binary number” in others.

Featurizers

The Boolean featurizers for this analysis comprise:

an email regex
a check for phone number format
a check for UUID format
a check for integer format
a check for floating point numerical format
a check that the length is exactly 1

The technical definitions for these are provided in the accompanying jupyter notebook.

As an example, the features generated for the input “-” are:

<div class="code-wrap"><code>{'email': False, 'uuid': False, 'phone_number': False, 'is_float': False, 'is_int': False, 'is_len1': True}</code></div>

and for “18005551212”:

<div class="code-wrap"><code>{'email': False, 'uuid': False, 'phone_number': True, 'is_float': True, 'is_int': True, 'is_len1': False}</code></div>

These sets of binary features are converted to “words”, by casting the Boolean values as 1s or 0s and then forming a string out of them. For example, the “words” defined by the features of “-” and “18005551212” shown above are 000001 and 001110, respectively. The database columns are then interpreted as “documents”, where the terms are these feature “words”.

Results

After organizing the data into documents as described above, we apply TF-IDF vectorization, using scikit-learn. The TF-IDF model uses a “smooth IDF” formulation which adds one to the document count of each term; that is, computes the inverse document frequency as log(N/(1+nt)), instead of the more literal definition of log(N/nt). In the preceding, N is the total number of documents and nt the number of documents in which a given term t occurs at least once. In practice the next step in a classification analysis would typically be to train a machine learning model (e.g., random forest, neural network) using the components of the TF-IDF vectors as input features. However, that’s an unnecessary step for the purposes of exploring the TF-IDF approach here. Instead, we apply the t-distributed stochastic neighbor embedding (T-SNE) algorithm to visualize the similarities between TF-IDF vectors, within and between data classes. The plots below are in t-SNE coordinates which have no inherent meaning - what’s important is the relative positioning between points, not the values of their coordinates.

Figure 1 shows the result of applying TF-IDF to the actual terms of the database columns. Not surprisingly, the bifurcated distribution of term frequency prevents the TF-IDF algorithm from making useful distinctions among data classes.

Figure 1

Figure 2 shows the result of applying TF-IDF to documents that are composed of the featurized words, as opposed to the actual terms. We see better separation of the data classes. Note that the clusters of phone number and binary number classes are near one another, indicating that there’s an inherent overlap between integers and phone numbers without string delimiters.

Figure 2

Finally, Figure 3 shows the result of the TF-IDF vectorization of documents of featurized words, with the modification of using a binary measure of local frequency, i.e. a word is counted with weight 1 no matter how many times it occurs in a document. In the database application, this is a way of applying a sub-linear term-frequency, thereby down-weighting frequent terms more strongly than might be necessary in a natural-language setting. We see the best separation of classes with this formulation.

Figure 3

Conclusion

We have described a non-standard application of TF-IDF vectorization to a “non-natural” language, namely, database columns. We see that TF-IDF naturally accounts for placeholder values that can be common in a database environment. However, the distributional properties of terms within database columns are markedly different than those of natural-language. This means that performance is improved if a sub-linear term-frequency function is used. In particular, a binary term-frequency function is suitable for this example. Although this post describes the essential aspects of some experimentation we’ve done, in our day-to-day work at Enigma we extend the concepts presented here in several ways, including using a more complex feature set and applying supervised classification algorithms.

Managing AWS Accounts at Scale

Sean Lingren — Tue, 21 May 2019 00:00:00 GMT

At Enigma we store and process sensitive data for our clients that we're committed to protecting. In order to meet a wide range of client and compliance security requirements we manage more than 30 AWS accounts, each with a different function and purpose. This blog post will outline how the Infrastructure team manages AWS accounts at scale while still providing a simple interface for our developers and clients to interact with the data.

AWS Organizations

We use a single AWS Organization to manage all of our AWS accounts. Within our organization there are three kinds of accounts: management, internal, and client. Management accounts are tightly protected and contain company wide resources like VPN, monitoring, and source control. Internal accounts contain all Enigma infrastructure, such as the resources running Enigma Public and our shared data pipelines. Lastly, client accounts contain all client specific data and resources. These accounts are the most siloed, and access is only granted to those directly involved in the work.

Managing Accounts with Terraform

In order to manage a large and growing list of clients we have developed a controlled workflow with terraform that empowers any developer to request a new account and quickly get one back that meets all of Enigma's strict compliance and security requirements.

All of our accounts and their shared resources are managed in Terraform, in a single repo that is applied with Gitlab CI/CD on merge to master. Commits to master on this repo are forbidden and merge requests must be approved by a member of the Infrastructure team.

The first step in the new account process is either a ticket or a direct merge request to the subaccounts repo. The terraform for the MR looks something like this—

<div class="code-wrap"><code>##########################

### Client - Example ###

##########################

module "subaccount_client_example" {

source = "git::https://git.com/terraform-modules/subaccount.git//"

name = "client-example"

email = "aws+client-example@example.com"

}

module "subaccount_client_example_base" {

source = "git::https://git.com/terraform-modules/subaccount-base.git//"

account_id = "${ module.subaccount_client_example.id }"

root_account_id = "${ var.root_account_id }"

account_alias = "client-example"

account_type = "client"

tags = "${ var.tags }"

}</code></div>

You can see that we have two core modules that are consumed by our subaccounts repo. The first is the subaccount module, which creates a new AWS Organization subaccount using the aws_organizations_account resource and outputs a list of account attributes.

The second module, subaccount-base does the work of provisioning the account with all of Enigma's base account resources using the new OrganizationAccountAccessRole role. Here we define the account alias, IAM roles, SSO provider, KMS keys, AWS Config rules, and other mandatory resources that secure our accounts

Managing Access to Subaccounts

Once the merge request is approved and merged into master GitLab will apply the terraform and output any changes. The account is created and compliant with all Enigma policies, but how do we grant access to the account?

At Enigma we use Okta as an SSO solution and terraform has a third party module that supports Okta resources. With this combination we can set up an Okta identity provider in every new account, Okta groups that provide access to that account, and Okta rules that add predefined end users to those groups.

After applying subaccount-base end users can request access to one of 5 roles through Okta: Admin, EnigmaAdmin, PowerUser, ReadOnly, or ViewOnly.

A critical detail here is that no one outside of the Infrastructure team is granted Admin access, and that the EnigmaAdmin role has admin permissions except for an IAM Permissions Boundary that prevents the role from removing any subaccount-base resources (cloudtrail, config) or creating any resources that we already have modules for (vpc).

Because of the fast moving and ad hoc nature of Enigma's client commitments most engineers and data scientists are given PowerUser access, which lets them rapidly prototype with client data while still maintaining a secure AWS environment that separates billing, networking, and API access.

Encouraging Developer to Write Terraform

Terraform is a great tool that can make infrastructure easier to manage and reason about, however there's a significant language learning curve that is often too much for a developer who just wants to create a database and start analyzing data.

At Enigma we want our developers to write their own terraform for any long-lived production infrastructure, so as part of the subaccount-base module we create a new GitLab repo with pre-templated terraform code that bootstraps remote state management and some other non-critical resources in the new AWS account. The new repo is set up just like all of our other terraform repos, with a protected master branch and CI/CD that plans MRs and applies merges to master.

This lets our developers get started with terraform easily, they don't have to worry about local setup or remote state and they can copy from existing templates in the repo.

Next Step

As Enigma continues to grow, and we move more and more client commitments to the cloud, the infrastructure challenges will only get more complex.

How do we manage access to shared APIs? How do we keep data segregated but still accessible? How can we scale beyond Terraform, to thousands of different customers? The Infrastructure team is responsible for all these challenges. Come join us!

Navigating Directed Graphs

Erick Katzenstein — Mon, 13 May 2019 00:00:00 GMT

Introduction

Navigating a large directed graph is an exercise in untangling a hairball. Take the complexity of a city and remove all of its urban planning — at Enigma, this is often the convolution we’re dealing with in our datasets. In the following paragraphs we explore how to unravel these tangled messes and get to tangible insights.

Force-Directed Graphs

The visualization above represents committee-to-committee transactions for the 2017–2018 election cycle (source: FEC). Looks like a lot to figure out, but here’s the kicker: this graph represents just 2,000 transactions, while the dataset has over 850,000 total committee-to-committee transactions.

This chart is a force-directed-graph (FDG), a clever tool for visualizing complex networks with a physics-based simulation. An FDG can be an effective tool for capturing the birds-eye view of a network topology. One can use it to spot local clusters, densities, and overall distribution.

While an FDG is a good summary of a network, it falls short of comprehensive. The above network is useless to study as an interface — we need to use network algorithms to whittle down to valuable areas of focus, and then use different visualizations for more thorough analysis. The case study below is an example of this process.

Adding Context

Leading into the mid-term elections of 2018, we created a tool to study committee-to-committee transactions called PAC Paths. We first built a directed graph with FEC data that represented all transactions, and then applied Dijkstra’s algorithm to find the shortest weighted path between two committees.

Using the tool, one can determine if any two committees are connected in election cycles dating back to 1979. We found some interesting cases, like the American Medical Association connected to Big Tobacco and End Citizens United to Ted Cruz for Senate. It’s likely that committee A and committee B are not aware of their connection, but the density of committee-to-committee transactions often leads to counterintuitive results.

A screenshot of the PAC Paths app.

Above is a screenshot of the interface, and you won’t find a force-directed graph in the tool at all. By applying the shortest path algorithm, we simplified the density of the network substantially, enabling the study of the connecting path rather than a tangled mess. Instead of FDG, the interface has a simple linear connection (left-side) coupled with a radial tidy tree for a selected committee (right-side).

Directed Graph Interfaces

The above screenshot is a simplified design for a complex graph. Now we’ll review the hybrid of a force-directed-graph and the PAC Paths interface. If we recall the wild network graph from the first paragraph, the visualization below should be a breath of fresh air.

This is the result of running Dijkstra’s algorithm and limiting the force-directed-graph to the four committees that define the path from End Citizens United to Ted Cruz for Senate. While the force-directed graph above reveals good information (DAYPAC has more connections in common with End Citizens United than with Ted Cruz for Senate), it still doesn’t provide the whole picture. To review a few issues:

Direction

The direction of these transactions are crucial (Big Tobacco giving to the American Medical Association would be a different signifier than the other way around), yet we haven’t represented them in the interface above. We can add arrows to each edge to represent direction, but this could quickly become unwieldy and illegible.

Additional Parameters

Suppose one wanted to take a closer look at a committee’s transaction amount compared to all other transactions made by that committee. How might this be represented? How about committee party affiliation? Or the state in which the committee is headquartered? Additional parameters don’t transfer well to a force-directed-graph.

Radio Tidy Tree

A radial tidy tree remedies some of these issues (at least for the study of a particular committee). We can see a radial tidy tree as the zoomed in version of a particular node in the force-directed-graph. By transforming a point into a circle, multiple dimensions are represented through polar coordinates.

Direction

The left-side represents incoming transactions (only one in the image below) while the right-side represents outgoing transactions.

A radial tidy tree to represent direction of transactions (incoming on the left, outgoing on the right).

Additional Parameters — Bundling

By using bundling, the user can group the incoming and outgoing transactions based on a parameter for study (in this case, amount, state, and party affiliation).

The four diagrams above are representative of the same committee, but the bundling parameter is different in each instance. The radial tidy tree gives multiple tiers of data for free, offering spatial and intuitive categorization.

Connecting the Interface

The gif below represents a tool that transitions from the force-directed graph to a radial tidy tree for study. The network view is a standard force-directed graph. In the path view, the shortest weighted path is represented with aserial-radial tidy tree diagram to give a glimpse of each committee’s connections. These committees can be re-bundled by changing the relevant parameters, and clicking on the central node will zoom in to a specific committee. You can experiment with the live interface here.

Summary

In our process, we’ve observed that analysis of a network graph is a series of incremental decision making: determining when to apply powerful algorithms, when to review and analyze their results, and when to solidify insights. The interface should therefore be tiered in parallel with these decisions. By developing a tool (and a visualization library) that’s intuitive and somewhat playful, the analysis can hold the user’s attention, and ideally enable the digestion of the hairball.

Scaling a Pandas ETL Job to 600GB

Ezzy Sriram — Wed, 08 May 2019 00:00:00 GMT

We all know the convenience that comes with processing a Pandas DataFrame at sweet in-memory speed. Recently, I was working with a small sample of company data in a city. I had written a quick Python script to load raw data from a Postgres table, transform and clean the raw data for some downstream machine learning processes, and write the cleaned data to another Postgres table. The ETL job ran every couple of weeks to add new cities, but then the client requested us to scale the job to every city in the United States. With the current script running locally on my Macbook, this would have taken an unpractical 120 hours- about a single working week.

Given the exploratory nature of our work, it was more than likely that we’d have to tweak our data cleaning / transform process several times in the near future. We needed a sub-24 hour runtime to allow for quick iterations and pivots. Here’s how we scaled our simple Pandas ETL workflow to process 600GB of data.

Our Options

Dask. A framework that allows for easy parallelization of existing Numpy, Pandas, and SK-learn operations, but we didn’t want to sacrifice the time associated with spinning up our own Cloud cluster using Kubernetes.
Spark. This option would have been less infrastructure heavy considering Enigma’s strong familiarity with DataBricks and AWS EMR, but the time & bugs that came with porting over our Pandas code to use RDDs forced us to take another look at reusing our existing Pandas workflow.
Scaling our existing Pandas job to process 600GB of data in parallel chunks.

We Chose Option 3

We settled on using the existing Pandas ETL Job. Why? We wanted to avoid the inevitable errors that would come with porting over our Pandas ETL workflow to Spark and we preferred a solution with existing and familiar infrastructure. We knew that we could hit a 2–6x speed increase by adding parallelization to the ETL script and rely on horizontal scaling across EC2 boxes if we needed quicker runtimes.

Architecture Overview

Here’s the breakdown of our approach. We had 600GB of the raw data in a Postgres table. We wanted to allow scaling to any number of “compute nodes” (in our case these were EC2 boxes with our script downloaded). Since the script utilized multiprocessing to operate on multiple chunks of data in parallel, we needed some way to partition the data into evenly sized chunks.

Partitioning and ETL Job Metadata Table

Since we were processing data in the United States, we decided to treat each data within one zip code as one chunk of data. There are over 40,000 zip codes, so it created small enough chunks to parallelize as well as offering decent size consistency across chunks.

We created a Postgres table to hold the metadata across all the compute nodes’ ETL jobs. Whenever the script was ready to consume another zip code, it queried the table to figure out a zip code that had yet to be consumed by a job.

If an ETL job found Enigma HQ’s zip code (10016) was available, for example, it would set that row’s status to “RUNNING,” and hit the Postgres table with the raw data to find the data located at 10016 zip code.

Performance Enhancements

Our approach was a quick solution to help us scale our Pandas ETL workflow and keep our clients happy, and it provided a number of interesting performance enhancements—

We reduced Pandas DataFrame memory usage by 50% by downcasting the default types. Read more about downcasting here.
We used a Python memory analyzer called mprof to measure the memory usage of a data chunk’s DataFrame compared its raw size in Postgres. We needed to be sure of the dataframes’ sizes in order to put a ceiling on the number of parallel processes in the script. The last thing we wanted were memory overflows cropping up all over the place. Here is a guide that proves how ridiculously easy it is to set up mprof.
Using psycopg2 rather than Pandas .to_sql() function to write dataframes to the database saved time by an order of magnitude. Why? Pandas “to_sql()” function creates a SQL insert statement for each row in the dataframe, so it has both inefficiencies in terms of SQL and network I/Os. We used psycopg2, a popular PostgreSQL Python adapter to leverage its ability to use Postgres’ efficient COPY command to bulk insert data. Read more about copying data via psycopg2 here. You can find other “bulk” insert approaches here.
Postgres DB box and EC2 boxes were located in close proximity to avoid unneeded network I/Os. This was more of a side effect of our infrastructure at Enigma, but since our Postgres box and EC2 instances were hosted in the same AWS region, we didn’t need to deal with unnecessary network I/Os that would have come with say, a database hosted elsewhere.

Testing Setup

I removed it from the code below in the interest of offering a clean template to copy, but testing the ETL module was a crucial element for us to move quickly. We created TEST_FLAG and DEBUG_FLAG parameters for the ETL script.

DEBUG_FLAG told the script to avoid writing data to the database. This allowed us to debug parts of the ETL job without worrying about accidental side affects to the real data being cleaned.
TEST_FLAG told the script to use a small Postgres table containing a sample of the actual raw data. It then ran through the script, wrote the cleaned data to a test table, and finally, compared this resulting data to another database table representing the source of truth data for the ETL functioning correctly.

The Code

Here’s a version of the code that worked on a single compute node, but with slight tweaks will work for the multi-node approach as well.

<div class="code-wrap"><code>import io

import logging

import pickle

import psycopg2

from sqlalchemy import create_engine

import multiprocessing as mp

import numpy as np

import pandas as pd

DB_URL = <Database URL (RFC-1738 format)>

NUM_PROCESSES = 8

RAW_DATA_TABLE_NAME = <existing_postgres_raw_data_table>

CLEANED_DATA_TABLE_NAME = <postgres_cleaned_data_table_to_create>

JOB_METADATA_TABLE_NAME = <postgres_metadata_table_to_create>

ZIP_CODE_TABLE_NAME = <existing_postgres_zip_code_table>

def create_cleaned_data():

zipcodes = []

# Create new Postgres table to insert cleaned data into

engine = create_engine(DB_URL)

# Hack to create an empty Postgres table with the schema from another table

query = "create table {} as select * from \"{}\" where 1 = 2".format(CLEANED_DATA_TABLE_NAME, <existing_postgres_claned_data_table>)

engine.execute(query)

# Create ETL Job Metadata table

create_metadata_table_sql = 'create table {table_name} \

(zipcode VARCHAR (50), \

status_message VARCHAR (50), \

exception_message VARCHAR (1000), \

timestamp TIMESTAMP WITH TIME ZONE)'.format(table_name=JOB_METADATA_TABLE_NAME)

engine.execute(create_metadata_table_sql)

engine.dispose()

# Retrieve the list of zipcodes

query = "select * from {}".format(ZIP_CODE_TABLE_NAME)

zipcodes = pd.read_sql(con=engine, sql=query)

zipcodes = zipcodes["ZIP_CODE"].values

# Parallelize mapping data per zip code (assign zipcode per process)

pool = mp.Pool(processes=NUM_PROCESSES)

data_cleaning_processes = [pool.apply_async(_clean_data, args=(zipcode,)) for zipcode in zipcodes]

for process in data_cleaning_processes:

process.get()

def _clean_data(zipcode):

db_connection = psycopg2.connect(user=<postgres user>, host=<postgres host>, dbname=<postgres database name>, password=<postgres user password>)

db_cursor = db_connection.cursor()

str_buffer = io.StringIO()

data_chunk = None

# Load data chunk from zip code to transform

engine = create_engine(DB_URL)

query = "select * from {} WHERE \"ZIP_CODE\" = '{}'".format(RAW_DATA_TABLE_NAME, zipcode)

data_chunk = pd.read_sql(con=engine, sql=query)

engine.dispose()

_add_audit_row(zipcode, "started")

try:

mapped_data_chunk = <Python function that transforms data>(data_chunk)

# Write CSV to 'str_buffer' buffer instead of to a file on disk

mapped_data_chunk.to_csv(str_buffer, sep='\t', header=False, index=False)

str_buffer.seek(0)

# psycopg2 cursor's efficient COPY_FROM command to bulk insert data

db_cursor.copy_from(str_buffer, CLEANED_DATA_TABLE_NAME, null="")

db_cursor.connection.commit()

logging.info("cleaned & pushed a data chunk to db")

_add_audit_row(zipcode, "finished")

except Exception as e:

_add_audit_row(zipcode, "error", str(e))

str_buffer.close()

db_cursor.close()

db_connection.close()

def _add_audit_row(zipcode, status_message="", exception_message=""):

sql = "INSERT INTO {}(zipcode, status_message, exception_message, timestamp) VALUES ('{}', '{}', '{}', current_timestamp);" \

.format(JOB_METADATA_TABLE_NAME, zipcode, status_message, exception_message)

engine = create_engine(DB_URL)

engine.execute(sql)

engine.dispose()

def main():

create_cleaned_data()

if __name__ == '__main__':

main() </code></div>

I hope you’ve found our quick Pandas scaling adventure useful. Cheers!

- Ezzy

At Enigma we provide the content, tools, and expertise to empower organizations looking to make sense of the world through data. It’s an ambitious project, so we’re recruiting aggressively to find not only the smartest people in the world, but also those who are passionate about our mission. Join us—we’re hiring.

Containerizing Data Workflows (And How to Have the Best of Both Worlds)

Tian Xie — Wed, 10 Apr 2019 00:00:00 GMT

As a data technology company, Enigma moves around a lot of data, and one of our main differentiators is linking nodes of seemingly unrelated public data together into a cohesive graph. For example, we might link a corporate registration to a government contract, an OSHA violation, a building violation, etc. This means we not only work with lots of data, but lots of different data, where each dataset is a unique snowflake slightly different from the next.

Wrangling high quantities and varieties of data requires the right tools, and we’ve found the best results with Airflow and Docker. In this post, I’ll explain how we’re using these, a few of the problems we’ve run into, and how we came up with Yoshi, our handy workaround tool.

If you work in data technologies, you’ve probably heard of Airflow and Docker, but for those of you who need a quick introduction…

Introducing Airflow

Airflow is designed to simplify running a graph of dependent tasks. Suppose we have a process where:

There exist five tasks: A, B, C, D, and E, and all need to complete successfully.
B, C and E depend on the successful completion of A.
D depends on the successful completion of B and C.

Considering each task as a node and each dependency as an edge forms a directed acyclic graph—or DAG for short.

If you are familiar with DAGs (or looked them up just now on “Cracking the Coding Interview”), you might think that if a DAG can be reasoned within the time of a job interview, then it can’t be that complex, right? In production, these systems are much more complex than a single topological sort. Questions such as “how are DAGs started?” or “how is the state of each DAG saved?” and “how is the next node started?” are answered by Airflow, which has led to its wide-spread adoption.

Scaling Airflow

In order to understand how Docker is used, it’s important to first understand how Airflow scales. The simplest implementation of Airflow could live on a single machine where:

DAGs are expressed as python files stored on the file system.
Storage is written to SQLite.
A webserver process serves a web admin interface.
A scheduler process forks tasks (the nodes in the DAG) as separate worker processes.

Unfortunately, this system can only scale to the size of the machine. Eventually, as DAGs are added and more throughput is needed, the demands on the system will exceed the size of the machine. In this case, airflow can expand to a distributed system.

The airflow webserver and scheduler continue running on the same master instance where DAG files are stored.
The scheduler connects to a database running on another machine to save state.
The scheduler connects to redis and uses celery to dispatch work to worker instances running on many worker machines.
Each worker machine can also run multiple airflow worker processes.

Now this system can scale to as many machines as you can afford,* solving the scaling problem! Unfortunately, switching to a distributed system generally exchanges scalability for infrastructural complexity—and that’s certainly the case here. Whereas it is easy to deploy code to one machine, it becomes exponentially harder to deploy to many machines (exponentially since that is the number of permutations of configuration that can go wrong).

If a distributed system is necessary, then it’s very likely that not only is the number of workers very high, but also the number of DAGs. A large variety of DAGs means a large variety of different sets of dependencies. Over time, updating every DAG to the latest version will become unmanageable and dependencies will diverge. There are systems for managing dependencies in your language of choice (e.g. virtualenv, rubygems, etc) and even systems for managing multiple versions of that language (e.g. pyenv, rbenv), but what if the dependency is at an even lower level? What if it depends on a different operating system?

Containerizing Workflows

Docker to the rescue!

Unless you have been living in a container (ha-ha) for the last five years, you’ve probably heard of containers. Docker is a system for building light-weight virtual machines (“images”) and running processes inside those virtual machines (“containers”). It solves both of these problems by keeping dependencies in distinct containers and moving dependency installation from a deploy process into the build process for each image.

When the code for a DAG (henceforth, this set of code will be referred to as a “workflow”) is pushed our remote git host and CI/CD system, it triggers a process to build an image.
An image is built with all of the dependencies for the workflow and pushed to a remote docker repository, making it accessible via URL.
At the same time, the airflow python DAG file is written. Rather than executing from the DAG directly, it specifies a command to execute in the docker image.
At run-time, airflow executes the DAG, thereby running a container for that image. This pulls the image from the docker repository, thereby pulling its dependencies.

Docker is not a perfect technology. It easily leads to docker-in-docker inception-holes and much has been written about its flaws, but nodes in a DAG are an ideal use-case. They are effectively enormous idempotent functions—code with input, output and no side-effects. They do not save state nor maintain long-lived connections to other services—two of the more frequently cited problems with Docker.

A Double-Edged Sword?

Docker exchanges loading dependencies at run-time for loading dependencies at build time. Once an image has been built, the dependencies are frozen. This is necessary to separate dependencies, but becomes an obstacle when multiple DAGs share the same dependency. When the same library upgrade needs to get delivered to multiple images, the only solution is to rebuild each image. Though it may sound far-fetched, this situation comes up all the time:

A change to an external API requires an update in all client applications.
A security flaw in a deeply nested dependency needs a patch.
DRY = “Don’t Repeat Yourself” is one of the central tenets of good software development, which should lead to shared libraries.

Code Injection

The double-edged sword endemic to Docker containers should sound familiar to anyone working with static builds. One common approach to solving this problem is to use plug-ins loaded at run-time. At Enigma, we developed a similar approach for Docker containers that we named Yoshi (hence, the Nintendo theme for this entire blog post).

As previously noted, when a workflow is pushed to our remote git repository and CI/CD system, it triggers an automated process to build an image for that workflow including installing all of its dependencies. Yoshi is a python package that is included as one of these dependencies and gets frozen on the image.
Since different workflows change at different rates, they go through the build process at different times and wind up with different versions. This is the nature of working with docker images.
Yoshi is also directly installed onto the machine where the airflow worker runs. The latest version is always installed on these machines.
At runtime, when the airflow worker executes the docker command, it mounts its local install of Yoshi onto the docker container. This injects the latest Yoshi code into that container, thereby updating Yoshi in the container to the latest version.

By keeping code we suspected might need to be updated frequently in Yoshi, keeping the interface to Yoshi stable and injecting the latest code at run-time, we are able to update code instantly across all workflows.

The Best of Both Worlds?

Injecting code at run-time allowed us to use all of the benefits of Docker containers, but also create exceptions when we needed. At first, this seemed like the best of both worlds, but over time we ran into flaws:

A stable interface and backwards compatibility are absolutely essential for allowing newer versions of a library to overwrite an older version, but that’s easier said than done. Maintaining compatibility across hundreds of workflows with different edge cases is even more challenging. Coming from working with containerized processes also required forming some new habits. No code is one-hundred-percent bug-free, but this led to many more bugs than we anticipated.
The most frequent use-case for Yoshi was for clients to access external resources. When external resources changed, Yoshi changed with them, which meant that older versions no longer worked. An image is expected to work forever, but the absence of the latest version of Yoshi broke that expectation.
Did I say that the most frequent use-case for Yoshi was for clients to access external resources? Turns out that was the only use-case. Initially, we expected to use Yoshi in many different ways, but wound up using it in the same place every time. This meant Yoshi was much larger and more complex than necessary and we only needed it in one node of the DAG.

Yoshi caused more bugs and complexity than we wanted, but by revealing where our code changed most frequently, it also revealed a simpler way to deploy updates across many DAGs.

Image Injection

Heretofore, images were built one-to-one for each DAG, but it does not have to be that way. Each workflow has its own set of dependencies, so an image is built for those dependencies, but each node in the DAG could use a different image. Additionally, Docker images are referenced by URL. The image stored for that URL can change without changing the URL. This means that a DAG node executing the same image URL, could execute different images.

Eventually, this led us to inject code by inserting updated images in the middle of a DAG.

The Yoshi library remained the same, with all of the same functionality, except now it was also packaged and executable from its own docker image.
Workflows were changed so that individual DAG nodes could use different image URLs. Nodes where our code interacted with external resources now used the Yoshi image instead of the workflow image.
The URL for the Yoshi image were resolved at run-time with environment variables from the machine so that different environments could use different URLs - e.g. staging could use an image tagged as staging and same for production.
When changes to the Yoshi library were pushed to our remote git repository, our CI/CD system built a new image and pushed it to the Docker repository at those URLs.
At run-time, the workflow pulls the latest Yoshi image.

Image injection not only allowed us to build workarounds to the double-edged sword of static Docker images—without the compatibility challenges of code injection—but building a Yoshi image also opened new doors to run Yoshi utilities from a command-line anywhere and run a Yoshi service.

It took us a long time to get there, but our final solution allowed us to have the best of both worlds, and then some.

Game Over.

*There is a limit to the number of machines that can connect to the same redis host, but that is most likely a lower limiting factor - especially for a start-up.

P-Hacking Recession Indicators

Caitlin Whalen — Tue, 12 Mar 2019 00:00:00 GMT

Every day in the media we read about an imminent economic downturn in the U.S. Depending on the article and the related data it references, the next recession sounds as though it could be mere months—if not minutes—away. Given this media focus, we went into Enigma’s Hack Week determined to find out whether we could more accurately predict a recession by looking at public data. After all, there were public data signals for the 2008/2009 recession, e.g., the data around unemployment, mortgages, housing prices, and so on. We asked ourselves: What could we be looking for in public data now that might predict the next recession, recognizing that the causes of recessions are not often repeated?

Within minutes it became clear that out-predicting a leading economist or think tank would be an impossible feat, but we decided to see whether we could find any public data that might at least correlate with, if not act as a leading indicator of, GDP contraction. Our intention was to have a little fun, but also illustrate just how easily conclusions can be drawn from spurious leading correlations. We looked at the most commonly tracked indicators (e.g., the yield curve, U.S. stock market performance, housing prices, unemployment rates) and compared those findings with some more, shall we say, obscure public data sources (e.g., avocado prices, cereal production, number of lawyers) to see if we could find a link. Thus began our p-hacking quest.

The Approach

Gather any/all time-series public data over the past 2 recessions (~ last 20 years)
Clean data to uniform format
Run Granger Causality Analysis across all datasets

Our approach was to P-hack our data to try and uncover any correlations between GDP (specifically GDP percentage change per quarter) and random public data sets that would offer any type of indicator. While p-hacking is widely-shunned amongst data scientists, it proved to be an ideal approach to uncovering spurious leading correlations between GDP and random public datasets. Moonlighting as p-hackers over Hack Week illuminated just how easy it is to manipulate data analysis to fit a certain narrative or thesis.

In our case, looking for potential recession signals across a random assortment of both traditional and unconventional data sources yielded some interesting and some obvious findings. While we by no means hold our brief analysis on par with the many economists that spend decades predicting recession activity, our p-hacking revealed that housing sales, retail alcohol sales and lightweight truck sales can be seen as leading indicators for a recession. You can scroll down to view a few spotlight analyses below.

P-Hacking Highlights

Ratio of houses for sale versus number of houses sold (“monthly supply of houses”)

Granger Causality (P value) = nearly 0 (lag of 2,3,4,5)

Our analysis revealed a negative correlation between trends in the ratio of houses for sale to houses sold and GDP, providing a leading indicator for economic growth. Instances of sharp increases in the listed-to-sold ratio correlate negatively with GDP for up to three subsequent quarters. With a p-value of approximately 0.0001, the relationship is reasonably firm, at least by p-hacking standards. The monthly supply of houses has been steadily increasing since November 2017, perhaps a sign that the U.S. economic outlook isn’t great.

Lightweight truck sales (total per quarter)

Granger Causality (P value) = 0.0357 (lag of 3)

We observed a positive correlation between lightweight truck sales (e.g., Ford F-150s) and U.S. GDP. Our analysis indicates a strong association between truck sales and potential recessions, which makes sense as higher truck sales would seem to indicate optimism regarding overall U.S. economic health while slower truck sales might forecast concerns about near-term economic performance. Lightweight truck sales have been increasing annually since 2010, which may be a positive indicator for continued GDP growth.

Alcohol retail sales (beer, wine, liquor, seasonally adjusted total per quarter)

Granger Causality (P value) = 0.0001 (lag of 2)

We observed a significant positive correlation between the level of alcohol retail sales and GDP. Once again, the association seems intuitive, as alcohol is a luxury good for most people and consumption would seem to increase with stronger economic performance. (However, it would be perhaps equally if not more intuitive for alcohol sales to spike ahead of and during an economic downturn...)

Conclusion

Ultimately our P-Hack Week experience taught us that as we hear predictions of a forthcoming recession, all of these analyses should be taken with a grain of salt. It’s very easy to find correlations with GDP, but that doesn’t signify a meaningful connection. In the meantime, we’ll be closely monitoring things like alcohol and Ford F-150 sales :).

Project Notes

Some data was scraped using Python. Data was cleaned and normalized using Jupyter notebooks.
We used Python to run a Granger causality test for time series correlation.
Visualizations were built in Chart.js.

Exploring Company Footprints

Abe Rubenstein — Tue, 12 Feb 2019 00:00:00 GMT

Enigma has a massive repository of unique U.S. company data. For our internal Hack Week this year, our team decided to explore a new dimension of the data we hold—that is, its temporal dimension.

At Enigma, we have focused on identifying, validating, and presenting data about a company, and then showcasing it as a single cumulative snapshot. By focusing more heavily on the relative timestamps of company events, such as the year in which a company acquired an operating licence, passed an inspection, or received funding, a clearer story of the company and the expectations we have of it emerges. One of the most telling measures of company’s future success and profitability is the company’s evolution in size, measured in headcount.

Our team was excited to spend Hack Week representing company data over time. We also wanted to create a visualization of changes, knowing that a timelapse would be an interesting and effective way for a viewer to understand Company Footprints.

The scope

To narrow the scope of our project, we decided to focus on companies in the San Francisco Bay Area since 2001.

We did this for a few reasons. First, we knew we had extensive annual demographic data on companies in this area and were confident that with some work, we would be able to capture their story in depth. Second, we were interested in what we’d find in this region, knowing the Bay Area to be a highly populated, educated, and wealthy part of the United States—how have Bay Area companies formed and changed?

Finally, we decided to showcase companies in the technology and restaurant industries, specifically. By narrowing down to two industries, we would be able to plot the companies on a San Francisco Bay Area map.

The back-end tech

We used two distinct data sources to capture Company Footprints. One contained yearly snapshots of company properties at a different points in time, e.g., “headcount” or “revenue” from 2001 - 2017. The other was our company graph, which contains information from multiple public datasets. References to individual companies have been resolved into a single company entity, with event-based attributes connected to those companies merged accordingly.

Our first step was to link the companies in the annual temporal growth data source to the companies in our company graph. We mapped each of the company name and location pairs to their entity id in company graph, and queried the graph for the relevant events. We then parsed out all of the annual headcount, annual revenue, and events from both data sources, combined them and aggregated by company, location and year.

Incorporating historic context

Because we were focused on temporal changes of companies, we thought that further contextualizing them in macro, socio-political events would further emphasize evolution as the theme of the project. We researched key socio-political events—both specific to the Bay Area as well as the U.S. overall—and included the ones we found to be most relevant in the timelapse alongside company headcount changes.

Events we show range from changes in Health and Safety regulations to Presidential elections, to headline-making IPOs. We’re not suggesting these events directly caused any company births, changes or deaths, but we wanted to offer some interesting reminders to viewers of our project to the changes in the outside world.

The front-end tech

The front-end itself is a simple JavaScript application that leverages just two libraries (Mapbox GL JS and Lodash), and the static portions of the front-end were created with HTML and CSS.

Mapbox takes care of the interactive map, and provides the impetus for updating the custom HTML map overlays whenever the map instance fires an event (e.g. click, drag, zoom). Upon page load, two map layers are created for markers and marker labels, respectively. These layers are reused throughout the animation. As the animation plays, new data in the form of static GeoJSON files is loaded into each layer.

The design

To craft an expressive user experience, we highlighted the two distinct industries using a visual design based on blending colors. When viewed together, the map layers for each industry emphasize the interrelatedness of these businesses over the last two decades.

We included a playable timeline, which allows the user to watch the changes unfold on the map, with data displayed on a yearly basis. We also developed a simple interface for filtering businesses by headcount and revenue to give users more control over the information (which is pretty dense). This offers a way to further declutter the map beyond just turning the map layers on and off.

Conclusion

Hack Week is short, and teams are always left with a million ideas on how they might, with more time, continue building their projects. Company Footprints is no exception.

We have plans to add more analysis to spot trends based on external historic context, as well as compare company footprints within industries, or segments like neighborhoods. We’re excited by the possibilities of a tool like this, and what else we might be able to uncover.

Dig Company Footprints and want to work with a data-driven team? Join us—we’re hiring.

Insights from Four Years of the Forbes Fintech 50 List

Madeline Ross — Fri, 08 Feb 2019 00:00:00 GMT

This week, we were honored to be included in the Forbes FinTech 50 list for the second year. We believe that data helps us better understand the world around us, so we decided to take a closer look at the four most recent years of the Forbes Fintech 50 list to see what trends the data reveals about how the fintech space has changed from 2015 to today. Play with the graphic, or read on below, to learn what we discovered.

Note: No list was produced in 2017.

Geography: The Bay Area is on top, but New York’s star is rising

New York is the center of the financial industry, but for Forbes the Bay Area reigns supreme for finance tech. San Francisco and the surrounding area were home to more than half of the companies named in every year except 2018. However, the tide may be shifting: from 2015 to 2019, New York’s share grew from 18% to just under a quarter, mirroring the city’s rise as a major center of innovation.

Company category: A blockchain bubble and waning interest in lending

What Fintech sub-category has remained strong on the list since 2015? Personal Finance and Investing, which includes companies like Acorns, Motif, and Betterment, represented 20% or more of each year’s Fintech 50 class. We can also see the rise and drop of interest in blockchain mirrored in a 2018 spike, and the gradual decline of buzz around lending companies from 2015 to 2019. Companies focused on Wall Street and enterprise technology (like Enigma) represent 20% of the list in 2019 down from 32% in 2015, possibly due to the high level of M&A and consolidation in recent years.

Funding: Up, up and up

From 2015 to 2019, funding received by companies on Forbes’ list has skyrocketed, with the median funding per company more than doubling from $61.5 million to $148 million. Total funding rose to almost $11 billion in 2019, up from a little over $6 billion in 2015. Some of this may be attributed to the rise of more mature companies on the list, as 2019 saw an increasing number of later-stage start-ups.

Looking ahead: Finding value in data

At first, it seemed like an error of omission - there was no “data” category on the 2019 list.

Then we realized that was because across all categories virtually every company on the list is built on data. While data is foundational for success in 2019, it’s only the beginning. We believe that in the future, context will be king.

As a data and technology player, how is Enigma different? We’re connecting data from thousands of sources to build a model of the global economy, transforming data from rows and columns into meaningful intelligence for some of the world’s leading organizations.

Are you interested in solving uniquely challenging and complex problems? Explore career opportunities at Enigma and help us make sense of the world through data.

Government Shutdown 2019

Enigma — Wed, 02 Jan 2019 00:00:00 GMT

During the U.S. government shutdown from 2018 - 2019, Enigma tracked the impact of the shutdown on federal salaries, employees, and government services.

Total estimated delayed salary

Federal government employees

Furloghed employees by department

Government services affected

SLUSH 2018: Real-World Data Solving Real-World Problems

Enigma — Wed, 05 Dec 2018 00:00:00 GMT

Enigma leverages real-world data in solving big social issues that affect society today. Armed with versatile case-examples and more than eight years of experience in data analytics, Enigma CEO Hicham Oudghiri steps on stage with Jason Karaian from Quartz to open up the world of data; for better and for worse.

Why We’re Joining the Fight Against Human Trafficking

Enigma — Tue, 04 Dec 2018 00:00:00 GMT

Our mission is to empower people to interpret and improve the world around them.

We take this to heart—each quarter we dedicate a portion of the company’s time and resources toward projects aimed at making real social impact. We call them Data for Social Good. In the past, we’ve produced a number of DfSG projects that serve the public, such as Smoke Signals—an open source tool that helps communities determine which homes are at the highest risk of not having a smoke alarm.

For the past few months, our focus has been on human trafficking and the ways in which data and technology can help end the global crisis. Public data, in particular, can provide valuable insights into trafficking trends.

Polaris—a nonprofit that works to combat and prevent modern slavery—estimates human trafficking enslaves 40.3 million people worldwide, with more than 10,000 victims last year in the U.S.

It’s a crisis as well as big business, raking in $150 billion worldwide each year for traffickers. This money flows through the global economy, including through American banks. Financial institutions in the U.S. are committed to addressing this crime and have anti-trafficking strategies in place, but cross-institution communication is challenging and it’s difficult to identify the criminals and share knowledge across the industry. As a result, human trafficking continues to thrive.

Enigma is working on a resource to make this kind of collaboration between financial institutions easier. Partnering with ACAMS and Polaris, as well as other organizations in the space, Enigma is founding Standing Together Against Trafficking (STAT). STAT is a technology platform aimed at helping financial institutions stay up-to-date and informed about trafficking indicators. STAT will help facilitate and enrich the sharing of typologies by organizing and mapping them for financial crime professionals. This will be an opportunity for financial institutions to not only be in communication with one another, but also alert each other to discoveries in real time.

We’re excited by the potential of this project to help move the dial in the fight against trafficking.

As we work on the platform, we also wanted to share some of the research and guidelines we’ve come across while preparing for this project. We discovered that many anti-trafficking resources are scattered across organizations and difficult to find in one place. Since STAT will be behind a firewall (in order to keep information secure), we pulled together these resources to make the information we found both easily discoverable and digestible.

Today, we’re excited to launch a resource microsite for financial services professionals fighting human trafficking. We hope this will serve as a central hub for those seeking guidance on anti-trafficking efforts, and move us all one step closer towards ending modern slavery.

Want to use data for good? We’re growing our team—join us.

Data Driven NYC: Leveraging Knowledge Graphs to Understand the World Around Us

Enigma — Tue, 27 Nov 2018 00:00:00 GMT

Enigma co-founder and CEO Hicham Oudghiri speaks at Data Driven NYC in November 2018 on how businesses are underutilizing public data and how leveraging knowledge graphs can provide unparalleled insight.

Things I Wish I'd Known About Spark When I Started (One Year Later Edition)

Jeremy Krinsley — Thu, 08 Nov 2018 00:00:00 GMT

About 12 months ago, we made a decision to move our entity resolution pipeline into the Scala/Spark universe. This was not without its pain points. This was our first major push as a company to productize entity resolution prototypes that had been in development for pretty much as long as the company has existed. It was also the first time our team had worked with either Scala or Spark.

Looking back over the year, there are dozens of "learning moments" that I would love to ship via wormhole to my former self.

In case the opportunity arises, here’s the transmission:

Know What You Shuffle

Shuffle is the transportation of data between workers across a Spark cluster's network. It's central for operations where a reorganization of data is required, referred to as wide dependencies (See Wide vs Narrow Dependencies). This kind of operation can quickly become the bottleneck of your Spark application. To use Spark well, you need to know what you shuffle, and for this it's essential that you know your data.

Skew Causes Bad Shuffles

Skew is an imbalance in the distribution of your data. If you fail to account for how your data is distributed, you may find that Spark naively places an overwhelming majority of rows on one executor, and a fraction on all the rest. This is skew, and it will kill your application, whether by causing out of memory errors, network timeouts, or exponentially long running processes that will never terminate.

Partition on Well-Distributed Columns

A powerful way to control Spark shuffles is to partition your data intelligently. Partitioning on the right column (or set of columns) helps to balance the amount of data that has to be mapped across the cluster network in order to perform actions. Partitioning on a unique ID is generally a good strategy, but don’t partition on sparsely filled columns or columns that over-represent particular values.

Beware the Default Partition

It's absolutely essential to model the number of partitions around the kinds of things you’re solving. In the stage of our application where we run parallel transformations on many heterogeneously-sized datasets at once, 200 partitions works just about fine.

When we are dealing with billions of pairwise comparisons, we have found that partitions in the range of 4-10k work most efficiently.

Furthermore, if you run tests on a single server (or locally), you may see dramatic speed improvements by re-partitioning data down to size 1. We recently squashed a particularly curious bug where our end-to-end test ran fine on our local 8 or 16 core machines, but would fail to ever complete on the 2-core server on which we run our CI. Combining the data down to 1 partition solved our issue.

Drive Your Jobs Into Overdrive with .par

While you can depend on Spark to do a lot of parallel heavy lifting, you can push your jobs even harder with thoughtful use of Scala's built in .par functionality, which can operate on iterables. The initial steps of our ER pipeline involve reading in dozens of heterogeneous datasets and applying shared transformation pipelines to each of them. A simple datasets.par.foreach cut our run times in half.

Of course, you can only rely on its usage for aspects of your pipeline that are completely deterministic and provide no risk of a race condition. Overzealous usage of .par can quickly result in mysteriously disappearing or overwritten data.

Joins Are Highly Flammable

Joins are by far the biggest shuffle offender, and the dangers of sql joining are amplified by the scale Spark enables. Even joining medium sized data can cause an explosion if there are repeated join values on both sides of your join. This is something that we at Enigma have to be particularly wary of, where 'unique' public data keys may result in a couple million row join exponentially exploding into a billion row join!

If there is a chance your join columns have null values, you are in danger of massive skew. A great solution to this problem is to "salt" your nulls. This essentially means pre-filling arbitrary values (like uuids) into empty cells prior to running a join.

Is Your Data Real Yet?

Operations in Spark are divided between transformations and actions. Transformations are lazy operations that allow Spark to optimize your query under the hood. They will set up a DataFrame for changes—like adding a column, or joining it to another—but will not execute on these plans. This can result in surprising results. For instance, it's important to remember that the behavior of a UDF is to not have a materialized value until an action is performed. Imagine, for instance, creating an id column using Spark's built-in monotonically_increasing_id, and then trying to join on that column. If you do not place an action between the generation of those ids (such as checkpointing), your values have not been materialized. The result will be non-deterministic!

Checkpointing Is Your Friend

Checkpointing is basically the process of saving data to disk and reloading it back in, which would be redundant anywhere else besides Spark. This both triggers an action on any waiting transformations, and it also truncates the Spark query plan for that object. Not only will this action show up in your spark UI (thus indicating where exactly you are in your job), it will help to avoid re-triggering latent udf actions in your DAG, and conserve resources, since it can potentially allow you to release memory that would otherwise be cached for downstream access. In our experience, checkpointed data is also a valuable source for data-debugging forensics and repurposing. The training data for our pipeline, for instance, is filtered out from a 500 million row table generated halfway through our application.

Sanity Check Your Runtime With Monitoring

The Spark UI is your friend, and so are monitoring tools like Ganglia that let you know how your run is going in real-time. Yarn's depiction of the Spark query plan can instantly communicate whether your intentions align with your execution. Is something that is supposed to be one join actually a cascade of many small joins?

The SparkUI also contains information on the job level, the stage level, and the executor level. This means you can get quickly see if the number/volume of data going to each partition or to each executor makes sense, and you can see if any part of your job is supposed to be 10% of the data but is taking 90% of the time. Monitoring tools that allow you to view your total memory and CPU usage across executors is essential for resource planning and autopsies on failed jobs.

When we first started using Spark, we used standalone clusters on Yarn and Amazons's EMRFS. We learned the hard way that gathering Spark logs is a non-trivial task. We are happy to now use Databricks, which handles the essential matter of log aggregation for us, but if you are spinning up your own solution, a log aggregation tool like Kibana is probably essential for introspection sanity.

Error Messages Don't Mean What They Say

It took quite a while to get used to the fact that Spark complains about one thing, when the problem is really somewhere else.

"Connection reset by peer" often implies you have skewed data and one particular worker has run out of memory.
“java.net.SocketTimeoutException: Write timed out” might mean you have set your number of partitions too high, and the filesystem is too slow at handling the number of simultaneous writes Spark is attempting to execute.
"Total size of serialized results... is bigger than spark.driver.maxResultSize" could mean you’ve set your number of partitions too high and results can’t fit onto a particular worker.
“Column x is not a member of table y”: You ran half your pipeline just to discover this sql join error. Front-load your run-time execution with validation to avoid having to reverse engineer these errors.
Sometimes you will get a real out of memory error, but the forensic work will be to understand why: Yes, you can increase the size of your individual workers to make this problem disappear, but before you do that, you should always ask yourself, "is the data well distributed?"

Scala/Spark CSV Reading Is Brittle

Coming from Python, it was a surprise to learn that naively reading CSVs in Scala/Spark often results in silent escape-character errors. The scenario: You have a CSV and naively read it into spark:

val df = spark.read.option("header", "true").csv("quote-happy.csv")

Your DataFrame seems happy—no runtime exceptions, and you can execute operations on the DataFrame. But after careful debugging of your columns, you realize that at some point in the data, literally everything has shifted over one or several columns. It turns out that to be safe, you need to include .option("escape", "\"") in your reads.

Better suggestion: Use Parquet!

Parquet Is Your Friend

The open-source file format is designed to offer read/and write operations an order of magnitude more efficient than uncompressed CSVs.

Parquet is "columnar" in that it is designed to only select data from those columns specified in, say, a Spark sql query, and skip over those that are not requested. Furthermore, it implements "predicate pushdown" operations on sql-like filtering operations that efficiently run queries on only relevant subsets of the values in a given column. Switching from uncompressed tabular file formats to parquet is one of the most fundamental things you can do to improve Spark performance.

If you are responsible for generating parquet from another format—say you are using PyArrow and Pandas for some large-scale migration—be conscious that simply creating a single parquet file gives up a major benefit of the format.

Conclusion

And there you have it, a loose assemblage of suggestions, cobbled together from a year of using Spark. Here’s hoping my future self has already found that wormhole and is sending me the year two edition as you’re reading this.

Interested in joining the team? Enigma is hiring!

Money 20/20: Empathy in the Machine Age

Enigma — Tue, 23 Oct 2018 00:00:00 GMT

Enigma CEO and Co-Founder Hicham Oudghiri sat down with Senior Managing Director & Global CMO of BlackRock Frank Cooper III at Money2020 (2018).

Can machines make services more human? Are ethics better captured through algorithms or subjective human behavior? Enigma CEO and co-founder Hicham Oudghiri sat down with Senior Managing Director & Global CMO of BlackRock Frank Cooper III to take a deep dive into the potential of harnessing machines to deliver deeper empathy and creativity in humans. Frank and Hicham discuss how machine learning and advanced data science are key to providing new insights, improving marketing effectiveness and solving the most pressing challenges facing our world today.

Integrating Autogenerated Content Into Your Documentation Site Using Swagger and Jekyll

Peter Henderson — Wed, 17 Oct 2018 00:00:00 GMT

Integrating autogenerated API documentation with code samples, diagrams, links to related content, or contextual information that doesn't fit within a docstring is a challenge. This article describes how we do this at Enigma using Swagger (OpenAPI) and Jekyll.

Jekyll is one of the most popular open source static site generators, but it takes a lot of work to turn an out-of-the-box Jekyll installation into a platform capable of hosting your product and API documentation. At Enigma, we adopted Jekyll over a year ago and have been using it along with Tom Johnson’s Documentation Theme for Jekyll—an open source framework that sits on top of Jekyll and provides many of the features technical writers consider essential, like support for multiple products, swappable navigation sidebars, etc. It has proven to be a capable and adaptable platform that has allowed us to maintain multiple documentation sites, both internal and external, along with version support, all from a single source.

One key extension we've implemented is a way to integrate our autogenerated Swagger (OpenAPI) documentation with our handwritten documentation (setup instructions, tutorials, etc.). This post explains how this integration works and offers a step-by-step implementation guide.

Before we get started, if you'd like to see an example of documentation that combines autogenerated content with handwritten content, here's a page from the Enigma Public API documentation that incorporates parameter and response information pulled from a Swagger file. Additionally, here's a page from the Enigma Public Python SDK documentation that uses the same autogenerated Swagger content with SDK-specific language. A single Jekyll page with some conditional logic generates both of these HTML pages.

About Swagger

Swagger is the de facto way to describe REST APIs, so chances are if you have a REST API then you also have a Swagger spec, or your developers can provide one without too much trouble. The spec defines all the API's endpoints and parameters, and the responses returned by the API. Generating a new Swagger file is typically part of the build process, so when a developer adds a new endpoint or a new query parameter, the Swagger JSON or YAML file is updated automatically.

If the source code includes a description with every endpoint method and parameter, then the Swagger file displayed within a frontend like Swagger UI may be the only API documentation you need. Frequently, though, you want to provide more than just bare bones autogenerated API documentation. If you want to include, say, code samples in multiple languages, diagrams, links to related content, or contextual information that doesn't fit neatly within a docstring, you'll need some way to combine the autogenerated content with handwritten content.

If you're familiar with Swagger, you've probably run into the Petstore example before. I use it here to demonstrate the Swagger-Jekyll integration. It's good because it has a variety of GETs, POSTs, etc. and uses both body and formData parameters on the PUTs and POSTs. The Petstore spec unfortunately doesn't have many descriptions, but that's fine—the point here is to demonstrate how the integration works.

All of the code referenced in this article and a full working demo are available on GitHub:

https://github.com/peterhend/documentation-theme-jekyll is a fork of Tom Johnson's Doc Theme repo to which I've added a complete implementation of the Petstore API docs generated from the spec.
https://peterhend.github.io/documentation-theme-jekyll/api_post_pet.html is the site running on GitHub pages.

Getting started

The key piece of Jekyll functionality you'll use is the Liquid templating language, specifically its ability to read data from JSON and YAML files. These files must be located in your Jekyll project's _data directory. The first step, therefore, is to put your Swagger file in the _data directory, or a subdirectory of it (see petstore.yml).

Next, you need some code to parse the file and extract the information you want. Since you don't want to repeat the same or similar code for each endpoint, I put the code in an "include" file that can be included in the doc pages as needed. I created four separate Swagger parsers in _includes/swagger_parsers: one each for models, parameters, and responses, plus a generic parser you can use to extract other endpoint attributes, like the method description. The last one is the most straightforward, so let's look at an example that uses that one first.

In the Swagger file, the spec for POST /pet/{petId} looks like this:

<div class="code-wrap"><code>paths:

/pet:

post:

tags:

- "pet"

summary: "Add a new pet to the store"

description: ""

operationId: "addPet"

consumes:

- "application/json"

- "application/xml"

produces:

- "application/xml"

- "application/json"

parameters:

- in: "body"

name: "body"

description: "Pet object that needs to be added to the store"

required: true

schema:

$ref: "#/definitions/Pet"

responses:

405:

description: "Invalid input"

security:

- petstore_auth:

- "write:pets"

- "read:pets"</code></div>

If you want to extract, say, the summary attribute for POST /pet, the full path to that attribute is:

<div class="code-wrap"><code>site.data.swagger.paths./pet.post.summary</code></div>

The site.data.swagger portion references the Jekyll project's _data/swagger directory, while the remainder is the path through the YAML tree to the attribute you want.

To include this directly in your docs, you need to first assign /pet to a variable, because you can't include the slash (/) character in paths. You can then put the full path inside double braces, referencing the variable:

<div class="code-wrap"><code>{% assign path = "/pet" %} {{ site.data.swagger.petstore.paths[path].post.summary }}</code></div>

In this case, the code resolves to add a new pet to the store. However, since we want code that isn't bound to a specific file, endpoint, method, or attribute, we'll do the following:

Define the file, endpoint, and method in the host page’s front matter.
Pass the attribute name as an “include” variable.

The host page Markdown therefore looks like this:

<div class="code-wrap"><code>---

title: POST /pet

sidebar: mydoc_sidebar

permalink: api_post_pet.html

swaggerfile: petstore

swaggerkey: /pet

method: post

---

## Description

{% include swagger_parsers/getattribute.md attribute="summary" %}</code></div>

And the "include" file looks like this:

<div class="code-wrap"><code>site.data.swagger[page.swaggerfile]paths[page.swaggerkey][page.method][include.attribute]</code></div>

Where:

page.swaggerfile, page.swaggerkey, and page.method reference the values specified in the page's front matter
include.attribute references the attribute name passed in via the include statement

The finished getattribute.md adds logic to support a type variable you can pass in to indicate if the attribute is a list, rather than a single value (for example, in the consumes and produces sections of the YAML example above).

Reading parameters

Each API endpoint method typically supports some combination of path parameters, query parameters, and body parameters. Where there are multiple parameters of a given type, it's typical to display these in a table format, displaying the name, type, description, and "required" status for each (see the POST /pet/{petId} body parameters for an example).

Here's the parameters section of the POST /pet/petId method:

<div class="code-wrap"><code>Paths:

...

/pet/{petId}:

...

post:

...

parameters:

- name: "petId"

in: "path"

description: "ID of pet that needs to be updated"

required: true

type: "integer"

format: "int64"

- name: "name"

in: "formData"

description: "Updated name of the pet"

required: false

type: "string"

- name: "status"

in: "formData"

description: "Updated status of the pet"

required: false

type: "string"</code></div>

It includes one path parameter and two formData parameters, each of which has a name, type, description, and required attribute.

The code to parse this is a little more complicated, as you must loop through the parameters portion of the file and and build an HTML table as you go, but it uses the same techniques introduced in the previous section. First, here's the host page Markdown, which "includes" three instances of getparams.md, with a different paramtype for each (you can see the rendered version at POST /pet/{petId}):

<div class="code-wrap"><code>---

title: POST /pet/{petId}

sidebar: mydoc_sidebar

permalink: api_post_pet_id.html

swaggerfile: petstore

swaggerkey: /pet/{petId}

method: post

---

## Path parameters

{% include swagger_parsers/getparams.md paramtype="path" %}

## Query parameters

{% include swagger_parsers/getparams.md paramtype="query" %}

## Body parameters

{% include swagger_parsers/getparams.md paramtype="formData" %}</code></div>

And here's a simplified version of the getparams.md "include" file:

{% assign parameters = site.data.swagger[page.swaggerfile]paths[page.swaggerkey][page.method]parameters %}

<table>

<thead>

<tr>

<th>Name</th><th>Type</th><th>Description</th><th>Required?</th>

</tr>

</thead>

{% for parameter in parameters%}

{% if parameter.in == include.paramtype %}

<tr>

<td><code>{{ parameter.name }}</code></td>

<td><code>{{ parameter.type }}</code></td>

<td>{{ parameter.description }}</td>

<td><code>{{ parameter.required }}</code></td>

</tr>

{% endif %}

{% endfor %}

</table>

As before, page.swaggerfile, page.swaggerkey, and page.method reference the values specified in the page's front matter. This time, include.paramtype references the parameter type passed in as a variable. The code loops through each parameter, checks to see if it's the parameter type we're looking for, and if it is, creates a table row with the details.

The full version of getparams.md includes code specifically to handle body parameters, since these are defined quite differently and reference the resource model definitions. It also includes code to check whether there are parameters of the specified type before it draws the table head, and handles enums by reading and displaying the allowed values.

Weaving in your own content

The get_pet_findbystatus.md page shows how, with the Swagger integration in place, you can add whatever content you want around it. I added an Examples section with examples for Python, Postman, and cURL, using the tab code provided in the Theme for Jekyll documentation. You can add whatever Markdown, HTML, or JavaScript you want anywhere in the host page. If you're ambitious, you can even add a live "Try it Out" feature, as we did in the Enigma Public API docs.

Conclusion

Swagger is great—there's no better way to autogenerate API docs for your REST APIs, but what you get from the Swagger spec alone is often not enough to fully support users of your API. I've demo'd one way to integrate the autogenerated docs with your handwritten docs, but see Integrating Swagger UI with the rest of your docs for other approaches you might consider.

Enigma Advances Strategy to Connect Enterprise Data to Real-World Data with $95M in Additional Funds

Enigma — Tue, 18 Sep 2018 00:00:00 GMT

Company Raises $95 Million in Funding Led by NEA, Joined by New Investors BB&T, MetLife, Capital One Growth Ventures, Third Point and Glynn Capital

New York, NY – September 18, 2018 – Enigma, a Data-as-a-Service company, today announced $95 million in financing to expand its network and platform that connects real-world and enterprise data to power key workflows. Enigma’s platform enables organizations to both utilize the data they have and leverage signals from real-world data. The funding, raised since Enigma’s last public funding announcement in June 2015, was led by return investor NEA and includes new investments from strategic FinTech investors BB&T, Capital One Growth Ventures, MetLife and Third Point along with leading venture firm Glynn Capital. Early investors Comcast Ventures, Crosslink Capital, Two Sigma Ventures, and the Partnership Fund for NYC also participated.

Enigma was founded on the thesis that incorporating signal from real-world and public data improves how enterprises operate. To enable this, Enigma has built a vast knowledge bank of people, places and companies, grounded in public data, to create a unique map of the world for individuals and enterprises. Building data-rich workflows for financial risk management, compliance, pharmaceuticals and sales and marketing use cases has driven the majority of the company’s efforts to date, while the new funding will accelerate its focus on risk, procurement and fraud. Enigma’s platform helps the company's partners in the Fortune 500, including BB&T, EMD Millipore, Celgene and Merck, better serve and engage their customers through new, contextualized workflows.

“Most of today’s innovative data work is rooted in understanding activity on the internet to support e-commerce and online marketing use cases,” said Enigma co-founder and Chief Executive Officer Hicham Oudghiri. “The biggest impact from data will come from understanding real-world connections and bringing these connections to bear on everyday decisions. This is the work we’re pioneering at Enigma, and this new capital allows us to double down on expanding our collection of real-world data, new technology offerings and a world-class team, which has fueled our customer retention and expansion to date.”

Enigma will use this investment to expand the company’s core offerings, focusing on broadening the reach of its knowledge graph technology. Enigma’s knowledge graphs, powered by public data and machine learning, structure data into connected insights and are the vehicle through which intelligence is delivered to clients. The company also plans to expand its integrated solutions into new verticals and build on its past successes in the Financial Services, Insurance and Life Sciences industries. To do this, Enigma will invest in growth by continuing to hire top-tier talent and opening the company’s first satellite office in order to enhance data acquisition and expand its data integration and delivery capabilities.

The strategic investments from recently launched venture arms at BB&T and MetLife mark inaugural investments by these financial institutions into FinTech, one of Enigma’s core verticals. Enigma’s success in this space is driven by its focus on building a high-quality, organized data collection and rigorous standardization methodology, helping financial institutions validate and continuously enrich the information they hold in order to combat financial crime, reduce low value and wasted work, and consistently put customers first. As financial institutions move faster to meet and exceed client expectations, companies like Enigma play a crucial role in continually enhancing the connections and intelligence these organizations count on when making the daily decisions that matter most.

“We are committed to continuing our digital transformation by investing in companies, like Enigma, that harness the power of advanced technology and top-tier talent,” said BB&T Chief Digital Officer W. Bennett Bradley. “Enigma’s real-world data assets and expertise, along with their robust machine-learning capabilities, made them an obvious choice for our first fintech investment. We believe there is more potential for Enigma to enhance other data-driven processes within the company for the ultimate benefit of our clients and shareholders.”

“As the inaugural investment for MetLife Digital Ventures, our partnership with Enigma marks a major milestone in our effort to promote a culture of innovation at MetLife and further solidifies our commitment to being a leader in digital transformation across the insurance industry,” said Marty Lippert, executive vice president and head of MetLife Global Technology & Operations and MetLife Holdings. “Enigma’s technology and machine learning capabilities have the potential to radically shift the insurance value chain by providing trustworthy intelligence to empower smarter, faster decision making.”

“With its use of real-world data, Enigma's unique and innovative approach to surfacing insights can be invaluable in helping enterprises understand their business processes and their customers more deeply,” said Aman Sharma, Partner at Capital One Growth Ventures. “The breadth of data Enigma brings to bear, coupled with its machine learning prowess, makes it uniquely positioned to uncover new strategic opportunities and create enhanced customer experiences across a range of enterprises and industries.”

“Enigma's expertise in data has helped uncover new ways to combine disparate adverse event data sources for use in our drug safety operations," said Ed Mingle, Executive Director and Head of Global Safety Operations at Celgene. "Enigma's machine-learning approach continues to reveal patterns in how duplicates emerge in Adverse Event reporting systems. Having this capability improves on overall data quality and reduces the potential consequences of working with duplicate data.”

“Our first impression of Enigma was that the company had the potential to rewrite the playbook for working with data—in the enterprise and at large,” said Scott Sandell, Managing General Partner at NEA. “Three years later, we’re seeing that potential realized. We’re extremely excited about Enigma’s position as a global data pioneer, driven by the company’s ability to unify and provide signal from a vast array of data at scale.”

About Enigma Enigma, a New York-based Data-as-a-Service company, transforms disparate, tabular data into rich representations of real-world relationships, providing a trusted source of intelligence about people, places and companies. From evaluating insurance risk to combating money laundering, Enigma connects and enriches clients’ internal data assets to transform their strategies and workflows. Leading organizations, including BB&T, EMD Millipore, Celgene and Merck depend on Enigma to power the daily decisions that matter. Enigma was the winner of TechCrunch Disrupt NY 2013 and a graduate of the 2014 FinTech Innovation Lab.

The Journey Towards a Knowledge Graph of Public Data

Jarrod Parker — Mon, 17 Sep 2018 00:00:00 GMT

Before I joined Enigma, I was skeptical of the company’s mission to make public data more accessible (I’ve since changed my tune). I wasn’t sure I bought the problem they were trying to solve—after all, the very name “public” data makes it sound like it’s free and easy to acquire. I wondered what the core challenge really was.

It wasn’t until I saw some of Enigma’s work (check out this video from when Enigma won TechCrunch Disrupt NY 2013) that I realized what the company was actually building: A way to connect vast amounts of data to provide a more granular picture of how the world operates.

Enigma’s aim to increase the accessibility of public, or real world, data has remained central to the company’s overall mission as we continue to build a unified base of knowledge of people, places, and companies. Enigma Public is the latest evolution of our free public data platform, which brings together thousands of data sources into a single searchable database.

In the seven years since our founding, we’ve continued to promote the accessibility of public data and provide Enigma Public as a resource—but we have also increasingly been hard at work creating capabilities to standardize, link and query data, to find deeper insights.

From Rows to Entities: Answering Increasingly Complex Questions

Enigma Public : Tables and rows, indexed in a curated taxonomy. Knowledge Graph : Entities and relations defined by an ontology, linked and indexed in a graph database.

We shifted our focus from data from tables and rows to entities and relations. Why?

We wanted to answer increasingly complex questions like, “how many company references are in a dataset?”

Our initial efforts involved building some column name heuristics to collect all the columns that appeared to be company names and running the equivalent of select count(*) on them. With this method, we missed out on ambiguously named columns like “name,” but at least it gave us an estimate. Even if this naive approach was 100% correct, though, it would still lack the information to answer a simple follow up question: Can we also get a list of all companies and associated locations?

We asked how we could know which locations were associated with a company—what are the relationships between the columns? We realized that knowing only the row co-occurrence of a company and location doesn’t necessarily mean the two are related in a direct way.

We are not the creators of the data we’re using, therefore we have no control over the schema or the degree to which the data is normalized. To help solve this problem, we use an ontology, which creates a shared vocabulary for your data. To read more about ontologies and what they mean for operationalizing data, check out my previous Semantic Data + Ontologies post.

A Disconnected Graph: Establishing Identities

After we sufficiently annotate our data with ontology mappings, we are left with a collection of very disconnected entities. We need to identify which entities are coreferent. This is where entity resolution comes into play.

Notice how the word “entity” is the suffix of “identity”? id + entity.

Entity resolution—regardless of implementation or accuracy—is simply picking the correct ID for entity references in data. Entity resolution and identity are intrinsically tied.

Even when implementing decent approaches to entity disambiguation, there are still edge cases. Purely statistical based approaches often have the problem in that they are only as good as the term-pair specificity found in their prior—or, in other words, just because a data point is unique, doesn’t mean it’s identifiable. Even if it were, there is sometimes still not enough information to know for sure.

A data scientist knows how rare it is to have 100% confidence, and entity resolution is no exception.

We’ve found that entity references in public data are Pareto Distributed. This means that approximately 80% of the records are accounted for by 20% of the entities. Which makes sense, intuitively—the bigger and more popular a company is, the more datasets it appears in.

Taking this into account, simply defining rules for well-known entities and their associated properties can effectively link more than 80% of the records with relatively minimal scientific effort (without regards to the engineering complexity).

A Connected Graph: Building Our Map

Resolving entities connects our otherwise disconnected graph into an asset of knowledge that is as rich as our ability to acquire new data and indicate its meaning. If we go back to our initial driving goal of being able to answer progressively complex questions of our data, we have to ask ourselves, does this really help solve that problem?

Let’s walk through an example of a complex question that a knowledge graph can easily answer:

Of the subsidiaries of Tesla, which facilities have OSHA violations and manage the release of a carcinogen? Are any of them within 50 miles of my home town?

If you wanted to answer this question in a bespoke manner, you might find some SEC subsidiary data, realize there are no addresses associated with the names, and then spend time searching and collecting addresses—perhaps using the OSHA establishment search page—and associated people into a spreadsheet.

To figure out which facilities manage the release of a carcinogen, you could then peruse the EPA Toxic Release Inventory to find the latitude and longitude of each location (which you’d then have to look up via something like Google Maps).

If you managed to walk through all of these steps—and didn’t make any mistakes—you may have finally found the answer to your question.

Even if you’re an experienced data sleuth, this process is cumbersome. And, if you then wanted to get the same answer for, say, Toyota, you’d have to do it all again manually.

With a knowledge graph, this question is a relatively simple graph traversal (paraphrasing).

<div class="code-wrap"><code>g.V().has(“name”, “Tesla”).out(“subsidiaries”).and(

out(‘osha_violation’),

out(“toxic_release”).has(“carcinogen”, true)

).out(“facility”).has(“point”, geoWithin(Geoshape.circle(74.0060, 40.7128, 50)))</code></div>

At Enigma, what we’re doing is several orders of magnitude greater than this pattern of work. We’ve created a data linking workflow that results in a graph of knowledge. This is the combination of few key components:

A data asset containing broad and deep information with hard-to-find and hard-to-maintain public data.
A pipeline process that simplifies the work of acquiring, standardizing, and linking new data.
An ontology mapping layer that stores our interpretation of the data.
A generic entity resolution process that increases in accuracy as more data is added.
Search and discovery capabilities using match prediction and a graph database.

By creating a knowledge graph of public data, we’re increasing the likelihood of providing new insight. We’re reducing the time it takes to answer complex questions -- those that once took hours, days, weeks to answer through complicated, non-reproducible data operations can now be answered in mere moments. We’re also making it possible to answer questions that may not have been considered answerable before. The opportunities, it seems, are endless.

Enigma’s Garden Model for ETL Tooling

Alden Golab — Mon, 13 Aug 2018 00:00:00 GMT

As a company rooted in public data, Enigma’s data engineering use case is the collection of small, heterogeneous, messy datasets rather than streaming real-time data. Instead of asking how to scale vertically to handle large data volume, our central question is one of horizontal scale: how can we acquire more public datasets, of various sizes, quickly, accurately, and stably?

We believe generalized and reliable tooling for data ingestion as the answer; however, the process of getting to our current solution has been one of fits and starts. Over several iterations, we have arrived at a system that is:

Applied: rooted in real-world application;
Collaborative: open to internal contributions from anyone at Enigma; and
Agnostic: flexible enough to encompass multiple approaches and open source systems.

How we got here

Enigma has experimented with different models for how to do data ingestion since our founding in 2011. About about four years ago, our team developed Parsekit, a proprietary Python toolkit developed by data engineers for fast ETL pipeline authoring. Designed to be simple to use and self-documenting, Parsekit employed a YAML configuration file that, once fed into the system, generated a fully operational pipeline easily scheduled to run on a server resourced with several workers. It also had some added flexibility, allowing users to insert ‘custom steps’ into the processor for edge cases.

By the time I joined Enigma a year ago, however, two problems with Parsekit had become clear.

First, authors were generally using only a handful of the standard Parsekit steps provided by the library; instead, all except the most basic users wound up shoehorning ‘custom code’ into the Parsekit pipeline at some point in the ingestion work. As the amount of custom code grew, maintenance became a nightmare.

Second, the Parsekit platform was meant to be like a tank: slow, steady, difficult to break, and highly abstracted from the well-engineered internals. But when it broke, it really broke.

Normally this would be fine, perhaps even desirable, but a contribution to the codebase was often a slog, both process-wise and because of the system’s engineered complexity. It was easier to just work around the problem at hand rather than try to solve the underlying issue, particularly when operating under tight timelines and client agreements for the delivery of data. Moreover, because of the way it was engineered, the system operated slowly: after a million rows, Parsekit would take hours, sometimes even days, to process a dataset.

Parsekit, despite coming from a desire to make ETL faster and maintainable, wound up with an over-engineered and inaccessible codebase; it was too complex and too abstract---with code development too far from day-to-day use. It became a bad fit for the data problems we face; but we had locked ourselves out of using alternative methods.

Kirby: Planting a Garden

We have two big needs for the tooling that will replace Parsekit:

Users must be able to contribute to the system while simultaneously delivering data under time constraints; and
Users should be able to use the libraries and techniques best suited for the task at hand, including any number of open source technologies, rather than fitting into a single mode of work.

These aren’t entirely engineering concerns; they are, essentially, process problems. We require tooling that reflects the unique process needs of our organization. It’s not enough to provide maintenance, efficiency, and stability guarantees.

From these two requirements, it became clear that the solution was to fully decouple the underlying orchestration from the code being executed. This freed us up to make our library less prescriptive and more of a garden of implementations, if you will: a user can go in and pick the functional implementations they desire, group them however they wish, and place them into (separately provided) cookware for execution---more on this in another blog post.

Thus, Kirby was born. Composed of well-tested functions, not steps or pipelines, Kirby operates as a kind of ETL buffet for users with clear contracts and small, totally orthogonal pieces, making contribution easy.

Kirby has three major qualities that we think will make it a long-lasting solution to our ETL use case:

It is Applied: contributions are only made on an as-required basis; implementations must be directly tied to an engineer’s needs to be accepted. No what-if’s, just what’s needed. This quickly yields a set of commonly needed functional themes.

It is also Collaborative: anyone in the organization can contribute directly to the code base if they wish to. By ‘enigma-sourcing’ the toolkit, we prevent the code from being inaccessible to new engineers while simultaneously reducing the desire to over-engineer the system. It also forces good documentation habits.

It is Agnostic: any library and technique, so long as it can be implemented via Python, is acceptable and will work. This allows us to take advantage of a variety of open source systems, from pandas to PySpark to dask and others, depending on the needs of the dataset being ingested.

Future State: Distributed Gardening

Since we began development, Kirby has vastly accelerated our team’s pace. In tying the development directly to the needs of our data ingestion backlog, we’ve quickly arrived at a fairly stable set of common, highly orthogonal implementations that are used across pipelines to acquire vastly different datasets.

Nevertheless, we’ve found that by improving our tooling we can only achieve linear acceleration with our small team of data engineers. We might get three or even five times our current speed of ingestion, but there is no ‘hockey stick’ growth here. No matter how much of the ETL work we abstract out into shared implementations, our data are just too messy and too unpredictable for us to achieve the kind of horizontal scale we want.

Since Kirby’s genesis was process driven, it is fitting then that the next step for us is to further adapt our processes after bringing our development approach in line. In the past few months, we’ve created the foundations necessary to be able to grow our data engineering organization horizontally to ingest all the data we need: creating a remote-friendly work culture and exploring the possibility of tapping additional markets outside of New York for talent. Call it ‘distributed gardening’.

Data 101: Semantic Data + Ontologies

Enigma — Thu, 26 Jul 2018 00:00:00 GMT

What is an ontology?

An ontology is a way in which to describe the world. From one perspective, language is an ontology; a set of labels to give meaning to real world things.

But if you don't speak the same language as another person, your communication will be reduced to less descriptive forms, like “talking with your hands.” You might be able to convey simple ideas, but as tasks become more complex, ambiguities become more common. Is that hand signal the number two, a rabbit, or the peace sign?

These ambiguities are a major part of why we find it amusing to play games like Pictionary or charades. We interpret the information given and fill in the gaps using context clues or our sense of humor and imagination. In a gameplay setting, it may be amusing to misinterpret that silly pose of a friend, or a poorly drawn horse. However, when collaborating to solve a complex problem, these constraints wreak havoc on efficient operations, especially when there is little coordination between parties. The path towards many failures is paved with ambiguities, misunderstandings, and inconsistent representations of data.

An ontology solves this problem by creating a shared vocabulary through which you can describe the semantics of your data and build applications.₁ By making your applications depend on an ontology as opposed to raw data columns, you are creating an abstraction that enables the flexible re-use of your applications and your data to different data sources and use cases.

When it comes to data, why do ontologies matter?

We are living in a world of information overload, and it's easier than ever to create information—sometimes even mandated by law. How do we best make use of this information? If I'm searching through multiple data sources, each from different creators, how can I be sure that columns in one dataset correspond to columns in another? You could make a standards guide to ensure everyone is creating data with consistent descriptive metadata, but people are still prone to typos and other errors may occur.

The dutiful secretary or analyst—logging data in a spreadsheet—is likely to name columns in ways they find meaningful to others. However, they may also try to save themselves a few keystrokes and drop letters from column names. This could lead to something like "org_nm", which really means "organization_name" to them—or was it "organism_name"? Is that organization like a company or a chess club? How do I ensure that when one spreadsheet has a column named "org_nm" it means the same thing as another spreadsheet's "company_name"? Are those names accurate?

This matters significantly when you’re trying to make use of multiple datasets to piece together a more complete picture of the world. It may not seem like a big deal on a handful of datasets, but when it takes 50 to 100 or more datasets to get a complete picture and the datasets change drastically over time, it demands a more robust solution than solely a human in the loop.

So, how can businesses make use of ontologies?

Entity Disambiguation:

When you are sitting atop thousands of datasets from many different sources—like Enigma is—you have to start to ask yourself questions like:

Where are all the companies in the data?
Of those companies, what are their addresses?
Is that the mailing address or the headquarters address?
How do we know this, and how can we know this automatically?

By simply labelling the columns in the datasets and their explicit relations to other columns, we can take a—not perfect, but still epic—leap into answering these more semantically rich queries that span N datasets and disambiguating references to entities that are the same type.

Pharmacovigilance:

When one organization is publishing results or services that are expected to be used by others, it is important that others know precisely what is meant when a service refers to Sleep Disorders, and which disorders that includes. If an individual reports they’ve experienced a rash on their foot as a result of a new medication and another individual reports they’ve experienced excessive dry skin, how does a rash relate to dry skin? What is the classification of the drug? Is it a Foot Cream or is it a Proton Pump Inhibitor?

When a regulation is issued for a specific category of drugs, how do we know my company’s drug is actually under regulation now? As far as I know, this is only feasible through consistent classification of drugs, side effects, medical procedures and other entities. Luckily, there has been a working group maintaining a freely accessible medical ontology called MedDRA since around 1999.

Synthetic Bad Data:

When an organization needs to build software that operates on data it might not be able to view, there needs to be a way to “mock” the data. One could write out a bunch of fake datasets and run tests to ensure the application works as expected amidst this particular fake data. This works, but when you’re building many applications that each require some degree of fake data and all your applications need to handle a certain category of bad data, it becomes cumbersome to maintain.

One way to alleviate this is to use your ontology to generate datasets in a semantically consistent manner, abiding by the relationships and types defined in your ontology. You can mimic the kinds of issues you might see in the data by defining a few different categories of bad data as a kind of noise function in your data generating process. What happens to my application when there are 10% null values? What about when there are non alpha-numerics in weird places? What about different representations of the same company name? When we start throwing things like “company” into the mix, or any indication of something that’s not just a transformation on a primitive data type, we need to know what we’re talking about.

Additionally, the way a company string can be bad is very different than how a location string can be bad. By using an ontology to link your applications, you are able to have multiple processes independently contributing new capabilities on a per entity basis. As folks discover new ways that raw data can be dirty, those learnings can help make a synthetic bad data generating system more ontologically aware, further empowering any applications that would rely on that entity’s cleanliness.₂

Big picture: what do ontologies/ontology management mean for operationalizing data?

“Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.” - Sir Tim Berners Lee

When creating or using data in an organization, enforcing a consistent vocabulary allows for serendipitous innovations to occur that may not have been fathomable before the data was linked by its vocabulary. You can start to ask questions of the data that were previously not answerable, and the time it takes to answer these questions reduces significantly.

Creating ontologies, mapping them to datasets and building ontology-driven applications are ways to prevent miscommunications stemming from schema inconsistencies. They also allow for re-usable applications that operate on entities and their relations, instead of specific rows and columns. Your ontology mapping and definitions become queryable metadata that allow for enterprise-wide inventories of applications and which entities they depend on.

Making sense of data is more important today than ever. Keep an eye on our blog for the next post in this series on how to make an ontology based application.

Endnotes:

1. Semantic Web:

Sir Tim Berners Lee, the inventor of the World Wide Web, mentioned Linked Open Data as the next frontier of the web. Known as the Semantic Web movement, it has been gaining momentum since the early days of the Internet. Like an ontology, the Internet was started by using a consistent vocabulary—in this case, a protocol—which allowed for a web browser on your computer to render web pages and know what to do when you click a link.

Here are some useful links to learn more about the semantic web and how ontology-linked data is already being used on The Internet:

2. Synthetic Data:

There are also open source tools that can help with generating fake data today, however do not include the ability to fake bad data. You can also go a surprisingly long way with a recurrent neural network:

An Interview with Gil Shklarski, CTO of Flatiron Health

Kelvin Chan — Thu, 21 Jun 2018 00:00:00 GMT

Building an Effective Engineering Culture, From Start up to Acquisition

Enigma: We’re excited to speak with Gil Shklarski today to learn a bit about what makes him tick. To start off, how did you get to where you are today?

Gil: I joined Flatiron Health in 2012 as the second engineer. Literally, the first interviews I had conducted with other [prospective] engineers were done while I was still at Facebook, even running them on my Facebook laptop afterhours in my home in Seattle.

We grew very slowly in the beginning--it was just me and a very small tech team. I was setting up security and IT, which a CTO does at a small company. As we continued growing, and after we acquired another company [Altos Solutions], I started focusing on building an effective engineering team and culture--building the organization. It became less about me contributing to the tech.

Previously, I worked at Facebook for a couple of years, where everything is high-scale; I worked on data pipelines for fraud detection and for identifying “bad actors” on the Facebook website. When was the last time you clicked on a bad link in Facebook that led you to malware? Probably never. That’s due to the data collection, and automatic analysis and response mechanisms, that my former team was building.

My first role in the U.S was at Microsoft, which I joined after working on my PhD in Israel. The funny thing about my PhD was realizing that as my work became more specialized, it became interesting to fewer and fewer people. Whereas at Microsoft, I would write a small piece of code that would go out to 25 million people instantaneously. So that was an exciting transition for me.

Lastly, before my PhD, I spent 10 years in government with the Israeli Defense Forces (IDF).

Enigma: You have an interesting background. You’ve been vocal in the past about growing engineering and, several years ago, gave a talk on “Engineering Ladders as a Cultural Manifesto”. Describe the title’s meaning and if your thinking has changed as you scale and grow your team?

"Initially, my goal was engineering would never be the bottleneck for creativity in how we build a business."

Gil: Initially, my goal was engineering would never be the bottleneck for creativity in how we build a business.

But before I get into what I mean by that , let me define “Engineering Culture” which I took from Kevin Scott [CTO of Microsoft]. It can be defined along three axes: 1) “how we build technology,” 2) “how we operate technology,” and 3) “how we function as a team.”

When I think of using ladders as that manifesto for culture, I think of it as how we ultimately encode our craft. How do we signal what we care about? How do we define what “good” is? How do we measure engineering excellence?

An effective ladder can formalize behaviors that leadership wants to promote and be accountable for. So we asked ourselves, “How do we encode how we want engineers to solve problems?” And this is not something that I think is specific to engineers, but rather something important to all major functions.

Enigma: Flatiron has grown considerably and you’ve had the opportunity to flex those concepts as you’ve scaled the engineering department. Has Flatiron’s manifesto changed as its grown?

Gil: Our core values have not changed; we still care about providing feedback, being kind, and getting our hands dirty. But the way we demonstrate those values did change.

When we had fewer than 30 engineers, it was all about optimizing for iterating towards an MVP (minimum viable product). At 130 engineers, we need to think about scaling platforms. At both stages, there are different behaviors we want to promote.

For our more junior engineers, the ladder change was fairly limited. The majority of changes were with our mid-to-senior engineers. What does it mean to be a senior engineer? What's the role of a tech lead and manager? What does a director mean? How do our tenured engineers help guide and mentor our more junior engineers?

Enigma: One of my favorite dimensions of your engineering ladder is your “GSD” (“Get Shit Done”) metric. How does this metric drive a successful engineering culture?

Gil: We actually took the GSD metric from engineering ladders at Google, Facebook, and RentTheRunway.

Four years ago, GSD was pure and simple. It wasn’t about sustainable execution. It was about iteration. My focus as CTO was purely on short term business impact, iterating quickly on MVPs in many of our business lines and being able to test ideas quickly.

As we grew, we needed to be a bit more nuanced about GSD - it became more about ensuring the quality of execution--strong technical execution. It now explicitly encapsulates technical product management skills (embodying our “work on problems that matter” value) and stakeholder management (ensuring others can actually work off of what you’ve built).

Enigma: When did the need for this shift in GSD from MVP iteration to scalability become apparent?

Gil: The shift from startup to scaling was apparent from our business requirements given the increased scope and complexity of our products. The change in the ladder was a bit latent. We had multiple signals that called for this shift., At the 1:1 level, it was our engineers saying that they need guidance for their career - How can they grow? Where should they grow?

Another signal was in our performance review process; we have committees to calibrate reviews and promotions. We realized, many of those conversations were becoming ambiguous. Why should someone be promoted over another person?

This encouraged us to re-evaluate how we encode behaviors. What does GSD mean today at our stage? We actually call it now “Technical Product Delivery”...

Enigma: As a result, has morale or quality of happiness changed? How do people feel about the shift?

Gil: It’s recent, so we’ll see. Ultimately, it's about giving engineers and the team their due credit.

Enigma: Looking back, what do you feel has made Flatiron uniquely successful in implementing this culture?

Gil: Ultimately, there’s this recognition that we engineer towards business results. And startups take such huge business risks, that we need to build technology to support those risks. As we continue to move from startup to maturity, we can take more technical risks.

Looking back, we were thoughtful about our mentoring capacity. We were thoughtful about our team building. We were thoughtful about how we'd hire and bring more engineers in. That's the bottom line

Engineering was the earliest most established team at Flatiron. We brought in people that came from Google, Amazon, AppNexus, etc. We were standing on the shoulders of giants in learning from their cultures. So engineering became the incubator for much of our company culture today. We were the first to do performance reviews, first to pilot career conversations, manager trainings. Even our interview structure was based on Microsoft, Facebook and soon became the basis of how interviews are done across the company.

The other realization, which is not a natural tendency for those outside of engineering, is that our goal is basically…to be opportunistically lazy. You want to automate yourself out. It is a key for being more efficient and it is the key to continuing to working on new, more difficult problems. This is how you advance your career. So we hired and helped craft our culture to promote this “make yourself redundant” approach.

Enigma: You started your career in the IDF, then Microsoft, then Facebook. How have each of those experiences impacted how you manage Flatiron today?

Gil: At the IDF, I learned a lot about entrepreneurship. Elite tech units in the IDF are much more like startups then big companies. The atmosphere there is very entrepreneurial. And while there, I learned not to be afraid to attack problems I don’t feel trained for. I learned to not be afraid to learn a new domain quickly if you need to solve a problem.

It gave me the confidence to join this startup. It gave me the confidence early on when we had setbacks. It felt like I was an officer in the army. One of the trained traits of an Israeli officer is being able to solve hard problems by utilizing specialists on your team. How do you utilize the main expertise of others (when you are not that expert yourself)?

Later, my IDF experience reminded me to never be afraid to hire people who are smarter than you. Rather, do the opposite. Talent was enormously important in the IDF—elite units at the IDF are often the top one percent of the talent in Israel. The bar is insanely high. One talented team member allows you to make a dent in hard problems quickly.

Mission was my second lesson. I was in active service on September 11th, 2001, though not on anything specifically related to counter-terror--just working in a defense-related role in a western democracy. I felt less helpless than I could have felt working in the private sector. I felt that in a way my daily contributions were making a difference. At Flatiron, our mission feels just as valuable. When you think about it, cancer kills more people than Al-Qaeda. With this mission, you feel less helpless when cancer strikes close to you.

Lastly, at Facebook, I learned how to build towards excellence. We adopted high standards for code development from day one. Literally during first week here, just us two engineers, we installed phabricator (Facebook’s code review tool). I learned about structured interviews, engineering ladders, and how to grow and evolve our own culture from my time at Facebook.

Enigma: Gil, thank you so much for your time.

The Secret World of Newline Characters

Yang Yang — Tue, 19 Jun 2018 00:00:00 GMT

While fixing a recent regression in Enigma Public's CSV ingestion (a few perfectly fine CSVs were now being rejected), I stumbled upon some curious discrepancies among Python idioms for handling newline characters. This led me down a rabbit hole of computing history and a world of exotic newline specimina so riveting that, at my colleague Eve’s suggestion, I figured it'd be worth sharing them.

The first thing to know about newlines is that, even in quotidian computing, they have many character representations. Each of the three traditional operating systems uses a different one:

\n Unix and Linux style,

\r\n Microsoft Windows style, and

\r the somewhat rarer MacOS classic style,

where \n and \r are conventional escape sequences for the ASCII characters Line Feed (LF) and Carriage Return (CR), respectively. Already, one should question why the Windows style employs two characters whereas the others get by with just one. This in fact harkens back to typewriter convention, in which a newline involves two actions: returning the carriage to the left-hand side, and advancing the paper by one line.

These three ASCII sequences account for newlines in nearly all plain-text documents, certainly amongst those you might email around or download from the web. So despite the occasional hiccup, modern cross-platform software has a pretty good handle on newlines, and Python is no exception. It has a concept of universal newlines which treats all variants with egality. Furthermore, the documentation for Python's CSV reader recommends a single preferred way of dealing with universal newlines.

Sometimes Python tries to be extra helpful. Suppose you have a multi-line string that needs to be split, wherever newlines occur, into multiple lines. Fortunately there's an aptly-named str.splitlines function to do exactly that, which you invoke and lo and behold, everything just works. So you send the strings to the CSV reader that prefers to receive individual lines one by one, and everything just works. And by the way, Requests (easily in the top-five most widely used Python libraries) also calls this function when you ask it for lines, and everything just works.

Then one day everything doesn't work, at which point you double-check the docs and realize str.splitlines has its own ideas about what a newline can be:

<div class="code-wrap"><code>\n Line Feed (LF)

\r Carriage Return (CR)

\r\n Carriage Return + Line Feed (CR+LF)

\x0b Line Tabulation (VT)

\x0c Form Feed (FF)

\x1c File Separator (FS)

\x1d Group Separator (GS)

\x1e Record Separator (RS)

\x85 Next Line (NEL)

\u2028 Line Separator (LS)

\u2029 Paragraph Separator (PS)</code></div>

Look, three may or may not be an acceptable number of newline variants, but eleven is definitely, unequivocally too many. Why would anyone need such multiplicity? Anyway, if you just want to keep the CSV reader happy, you'd find a way to just write the code to split lines without calling str.splitlines and get on with your day. But if you're me, you end up trawling the internet for the origin story behind every side character on this list (and then writing that code).

So here's the result of that trawl.

__________________________________________________

In 1963, the ASCII standard defined a character encoding for teleprinters, based on existing telegraph codes. The aforementioned LF and CR are part of the set of ASCII control characters, and among str.splitlines()'s list of newlines, five other control characters hail from this same set. Mr. Lammert Bies provides elucidating descriptions for them:

Line Tab a.k.a. Vertical Tab

"The vertical tab is like the horizontal tab defined to reduce the amount of work for creating layouts, and also reduce the amount of storage space for formatted text pages. The VT control code is used to jump to the next marked line."

In the world of typewriters, a vertical tab typically moved a distance of 6 lines, the same way a horizontal tab would typically move a distance of 8 spaces. In old printers, the vertical tab would also speed up vertical movement by indicating a jump to the next spot on a special tab belt, which was helpful for aligning content on forms.

Form Feed

"The form feed code FF was designed to control the behaviour of printers. When receiving this code the printer moves to the next sheet of paper."

File Separator

"The file separator FS is an interesting control code, as it gives us insight in the way that computer technology was organized in the sixties. We are now used to random access media like RAM and magnetic disks, but when the ASCII standard was defined, most data was serial. I am not only talking about serial communications, but also about serial storage like punch cards, paper tape and magnetic tapes. In such a situation it is clearly efficient to have a single control code to signal the separation of two files. The FS was defined for this purpose."

Nowadays we still need a way to delimit files within a serialized stream, for example when uploading photos on a website. But how do we get around the fact that each file, especially a non-text image file, could itself contain the FS character? The MIME spec calls for a custom-defined boundary, and suggests using an improbable string of gibberish:

<div class="code-wrap"><code>Content-Type: multipart/mixed;

boundary=gc0p4Jq0M2Yt08jU534c0p</code></div>

Group Separator

"Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn't used at that moment and the ASCII people called it a group."

Record Separator

"Within a group (or table) the records are separated with RS or record separator."

We occasionally see CSV-ish files that use RS to separate records, which at first sounds defensible but honestly doesn't really help, because CSV authors just want to hit the enter key. And now your CSV parser has to support yet another newline.

__________________________________________________

In the late 1970s, ASCII was extended by the ANSI standard to include additional control characters—to differentiate, the former are called C0 controls, the latter C1 controls. Using these new-fangled computer terminals of the day (such as 1978's VT100) could draw primitive graphics at arbitrary cursor locations. Aivosto Oy takes us on a helpful tour of these:

"According to ANSI, the C1 controls were intended for input/output control of two-dimensional character-imaging devices, including interactive terminals of both the cathode ray tube and printer types, as well as output to microfilm printers."

Evidently, the authors could not resist adding in a new-fangled newline amongst this fresh batch of characters.

C1 Next Line

"LF, having two alternative functions, has been a major source of confusion. While LF was initially defined as a "move down" operator, standards began to allow LF as a newline too. As a result, operating systems differ in their definition of a newline. A newline is LF on Unix. Operating systems using CR LF include CP/M, DOS, OS/2 and Windows. Naturally, this caused an incompatibility. To solve the problem, control characters IND and NEL were added to the C1 area. This did not solve the issue, resulting in IND being removed later.

Note: NEL maps to the control character NL (New Line) in the EBCDIC character set used on IBM mainframes."

EBCDIC is an encoding descended from punched cards and the six bit decimal code used with most IBMs of the late 1950s and early 1960s. Wikipedia has a great picture of such a punch card.

__________________________________________________

Finally, in the early 1990s when it was becoming increasingly obvious that the Internet, and soon the burgeoning World Wide Web in particular, would require a character set that supported all multilingual text, Unicode was born. By the time Unicode hit version 1.1 in 1993, it included the majority of common European- and Asian-based characters as well as—surprise, surprise—a few new control characters of course:

"A paragraph separator--independent of how it is encoded--is used to indicate a separation between paragraphs. A line separator indicates where a line break alone should occur, typically within a paragraph. For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to older usage of HTML <P> (modern HTML delimits paragraphs by enclosing them in <P>...</P>).

The Unicode Standard defines two unambiguous separator characters: U+2029 (PS) and U+2028 (LS). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous."

Yes, this surely made everything better.

__________________________________________________

Given the reality of reading CSVs, at best a loose convention with more interpretations and incarnations than even the newline, the most sanity-preserving path is usually to stick to the basic newlines (LF, CR+LF, CR) and call it a day, if you can get away with it.

But if one day you encounter a VT masquerading as a space in the text editor, or rescue some long-siloed database that was instructed by its departed master to delimit records with RS, perhaps you'll recall the enigmatic history of these dust-gathering control characters.

Collision 2018: Does AI Have More Potential for Good than Harm?

Enigma — Wed, 02 May 2018 00:00:00 GMT

Hicham Oudghiri, CEO and co-founder of Enigma joins Sariah Ashman, CEO of Wolff Olins and John Avalon, Editor-in-Chief and Managing Director of The Daily Beast, on the Center Stage of Collision Conference 2018 to discuss the impact AI may have on society and shed light on how AI will incentivize a world in which cooperation, not competition, will drive future economic value.

Improving Entity Resolution with the Soft TF-IDF Algorithm

Nick Becker — Tue, 17 Apr 2018 00:00:00 GMT

Here at Enigma, we extract signal from public data by linking together many datasets. Often, the datasets are published by different groups, so linking records across them is more difficult than a simple join or merge. This means that our engineering and data science team thinks a lot about how to improve our methods for data linkage.

Recently, we wanted to link the Open Payments dataset, which lists payments that drug and device companies made to doctors, with other healthcare datasets to explore how receiving these payments affect doctor’s behavior (e.g., prescribing, services, and referrals). The datasets describing doctor’s behavior are easy to link with one another because they include the unique National Provider Identifier (NPI) for each doctor. Unfortunately, the Open Payments dataset does not contain the NPI, so we need to link the Open Payments dataset with the Healthcare Licenses dataset to append each doctor’s NPI to the payments information. Once we add the NPI to the Open Payments dataset, we can easily find links between payments and behaviors.

To connect the Open Payments data with the NPI information, we need to match records using the names and addresses of doctors. If the names and addresses are identical across datasets, matching is straightforward -- just apply some simple data cleaning and search for exact matches. In our dataset, this naive method matches just over 50% of payments with physicians. Not bad, but looking at some of the non-matched pairs makes it clear that this inflexible matching method overlooks a lot of matches. For instance, do these records describe the same doctor?

SENTHIL K NATARAJAN 1870 WINTON RD S, SUITE 1, ROCHESTER, NY

SENTHILRAJAN KASIRAJAN NATARAJAN 1870 WINTON RD S, STE 1, ROCHESTER, NY

These probably refer to the same person, but in order to link non-exact matches we need to use fuzzy matching to compare text across datasets. A standard fuzzy-matching technique called “Jaro Similarity” measures how close two strings match on a scale of 0 (nothing in common) to 1 (they are exactly the same) by measuring how many edits you’d need to make to convert one string into the other. The Jaro similarity for these two names is 0.78, which sounds pretty good. But if we decided that everything at 0.78 or higher was a match, we’d also end up matching names like MICHELLE RIMPENS with MICHAEL ROBINSON.

Recently, we tested a method that can improve the results of fuzzy-matching. This method, called Soft TF-IDF1, builds on existing fuzzy-matching methods by considering how frequently various letter combinations appear in the data. There are many common suffixes for surnames (e.g., -son, -poulos, -ski) or terms in addresses (e.g., ave, st, suite) that we should downplay—we wouldn’t want to say that two addresses are similar just because they both mention being on an avenue—so we should focus on the things that make names and addresses unique when measuring their similarity. Researchers have found this effective for linking records across datasets, so we were optimistic that the Soft TF-IDF method would improve our ability to connect more doctors with their NPIs.

In the next few sections, we detail how we built a scalable pipeline for linking datasets using the Soft TF-IDF algorithm.

Designing a Scalable, Nuanced Approach

Deciding which observations from each dataset refer to the same doctor is more generally known as entity resolution (ER). Procedures for entity resolution are usually based on advanced string comparisons, such as the Soft TF-IDF, to compensate for noise and imprecision in records across datasets. However, these methods are computationally expensive and would prevent us from scaling up to large datasets if we needed to compare every pair of records across datasets. To help ER scale, we use blocking to more efficiently decide which pairs of records need the more demanding string comparisons and which pairs can be safely ignored.

In the blocking stage, we group similar records together into a “block." In the disambiguation stage, we perform the computationally expensive operations within each block to determine which records match. For example, you could parse an address and only perform entity resolution on records that come from the same city because we would be confident that records describing entities in different cities are not the same. This allows you to use powerful (but demanding) matching techniques, like Jaro Similarity or Soft TF-IDF, on large datasets without wasting time comparing records that are clearly non-matches.

Blocking

For our current use case, we perform blocking based on the geographic location of each doctor. We geocoded each address from the Open Payments and NPI datasets to get the latitude and longitude of each location. Using geocoded addresses lets us capture the fact that, 4802 10TH AVE, BROOKLYN, NY is actually the same building as 948 48TH ST, BROOKLYN, NY, despite the addresses looking quite different.

We store the latitudes and longitudes in a tree-based data structure to make it easy to find which addresses are near one another. For two-dimensional data, like latitudes and longitudes, a standard KD-tree organizes the data so it's easy to find other nearby points. It does this by repeatedly splitting and regrouping data into sets with similar values -- it first finds the median latitude and divides the data into two groups, then it finds the median longitude and splits in half again to create four groups. We apply this split-and-regroup process over and over until each group has no fewer than 16 points. By constructing the tree in such a way, we can find nearby points far more efficiently than if we had to look through the entire dataset.

The process for finding points a set distance around a query point is called using the “range query," and looks like this:

Any point within the circle can be considered part of the entity resolution block for that central point. Tree indexing lets us quickly find these points. Once we find that a node doesn't contain overlap with the specified range, we know that none of the child nodes can either. By starting at the highest regions, we can quickly eliminate most of the data for being too far away.

For the blocking phase in our ER pipeline, we use the range query to find all the potential matches for a doctor’s address in the Open Payments dataset by looking at the addresses for with NPIs that are within 100 meters (approximately the size of a city block). Then, disambiguation proceeds on this small group of potential matches.

This hyperlocal blocking is fast to implement, given the tree structures and geocode-able data, and greatly reduces the number of string comparisons needed during disambiguation. If we didn’t use any blocking, we would need to perform billions of computationally expensive string comparisons on doctors. But, using this blocking strategy filters out most of those comparisons. Billions of potential string comparisons are reduced to only about 260,000 comparisons, a reduction of over 99.999%.

Disambiguation

After the blocking method identifies potential matches based on doctor’s address, we do pairwise comparisons of doctors names to determine which potential matches are actual matches. We compared two similarity measures in the disambiguation phase: the standard Jaro Similarity, and the Jaro Similarity plus the Soft TF-IDF algorithm mentioned above.

Regardless of which similarity metric you use, you will need to decide on a threshold for determining how similar entries need to be in order to be designated as a match. This decision always comes with a tradeoff. A low threshold will find more matches, but also more false positives, while a high threshold will result in the opposite. For our case, we want to minimize the number of false-positives to be certain we are accurately connecting doctors and payments, so set relatively high thresholds.

Results

We decided to focus on doctors in New York, Enigma’s home state, to compare ER pipelines using Jaro Similarity alone and including both Jaro Similarity and Soft TF-IDF for disambiguation. The Healthcare Licenses dataset has 385,573 New York healthcare professionals and Open Payments lists 49,033 unique New York doctors.

Using standard Jaro similarity as a baseline, we matched 37,406 doctors with an NPI. Using both Jaro similarity and the Soft TF-IDF algorithm increases this number to 38,640—that means we found matches for ~1,200 more doctors before! We manually reviewed a sample of the matches for each method and found very few false-positives in either (fewer than 1% of matches). So, incorporating the Soft TF-IDF method was the clear winner because it was able to find more matches without increasing our false positive rate. We were able to match about 80% of payments to physician NPIs, a huge step up from the 50% we were able to match with with naive exact matching.

How Soft TF-IDF Similarity Works

Now that we’ve demonstrated the benefits of using the Soft TF-IDF method in comparison to standard string similarity metrics, let’s explore what’s happening under the hood. Below, we’ll walk through the mathematical explanation of the Soft TF-IDF algorithm. If you’re less interested in the mechanics, feel free to skip this section.

Soft TF-IDF is very similar to the standard TF-IDF algorithm, which can be used to evaluate the similarity of two records by considering how frequently they occur in the data. Except, now we're also letting tokens that almost match count, too. It was introduced in Cohen et al (2003), and there are a few ways to implement it. We'll walk through one version, shown below.

The formula looks intimidating, but looking at each component independently makes it significantly more clear. Let's dive in.

We want to compare two strings, s and t, that come from corpuses S and T, respectively.

If we’re comparing two strings, and we see a token is in s but not t, we probably don’t want that token to increase our similarity score. But, we mightwant to use it if it almost matches another word, depending on how close the match is. So, we can build our similarity measure by only considering the subset of tokens that match something in the other set above a threshold θ. Typically, we would use a standard similarity measure (such as the Jaro Similarity) to make these “first level” string comparisons.

V’(w,s) is defined as the TF-IDF weight of token w in string s, computed based on the entire corpus S. We then normalize by the square root of the sum of all the squared V’s (one for each token) in the set.

This is the same as above, but for the token from set T.

This term is the measure of similarity (using the “first level” similarity measure) between the token and the token it matched to in the other set. As a result, D(w,t) is essentially a normalizing coefficient, which dampens the impact on the Soft TF-IDF (“second level”) similarity from words that matched with low “first level” similarities. For example, if two tokens matched with a score of 0.9, D would be set to 0.9.

By putting this all together again, it's clear that the total similarity is just the combined similarity scores of the all matching tokens. And that's all there is to it. This is Soft TF-IDF, the nuanced method we need to better compare two strings from different corpuses. With standard approaches and Soft TF-IDF, we've got exactly what we need for enhanced linking.

Because this approach uses the TF-IDF weighting of tokens in the strings, it can match records that standard Jaro-Similarity matching would miss without introducing false positives, such as our original example:

SENTHIL K NATARAJAN 1870 WINTON RD S, SUITE 1, ROCHESTER, NY

SENTHILRAJAN KASIRAJAN NATARAJAN 1870 WINTON RD S, STE 1, ROCHESTER, NY

Dr. Natarajan is just one of the many additional matches we can now make.

Final Thoughts

Implementing Soft TF-IDF to augment standard metrics like Jaro Similarity helped us uncover ~1,200 new matches in public data without overwhelming us with false matches. By effectively linking data, it's possible to better investigate important questions such as whether payments affect doctor behavior or whether a business is riskier than it seems on first glance.

We’re constantly improve our linking capabilities and empowering our clients to make more intelligent decisions by seeing the whole picture. If you think working on that sounds cool, we're always searching for talented Data Scientists and Software Engineers.

Parental Leave at Enigma

Rebecca Price — Wed, 14 Mar 2018 00:00:00 GMT

At Enigma, we pride ourselves on being open, generous, and supportive of one another. It’s baked into who we are and who we hire. As Enigma’s Head of People, my role is to propose, implement and manage policies that support our employees so they can thrive both inside and outside the office. To me, a central tenet of leadership is to lead from the heart. My heart changed fundamentally after becoming a mother (twice), and this in turn has fundamentally changed how I think and operate in my role at Enigma.

I have navigated the uphill terrain of transitioning into parenthood twice, including: working full-time up to my delivery date, receiving six weeks of paid parental leave (which felt all too short); taking extended unpaid leave for another 6 weeks, despite it being a financial sacrifice for my family; and returning (tearfully) to work full time, leaving my infant daughters in daycare.

I understand the financial, emotional, and physical impact of these milestones. Whatever your gender, age, socioeconomic status, sexual orientation, or race, becoming a parent is a life altering experience. At work, I believe everyone should feel supported through this transition, especially where, as an employee, one may question their ability to keep pace with colleagues and expectations.

Part of what drew me to Enigma was its values, which include generosity, and in this case, putting people first. I feel fortunate to work for a company where the leadership team is supportive of diversity, culture and people. Another draw was the potential to create HR policies that would make these values a reality. We believe when we lead with generosity, we build good will and increased discretionary effort with our employees. We also increase the likelihood that new parents choose to remain within a rigorous and engaging startup career.

To that end, I am proud to share highlights from Enigma’s recently updated parental leave program, which includes the following for all employees who have been with the company for at least 6 months:

20 consecutive weeks of paid leave for primary caregivers; 10 consecutive weeks of paid leave for secondary caregivers, taken in the first six months
Up to 32 (primary) or 42 (secondary) additional consecutive weeks of unpaid leave, totaling up to a year of time off
Continued provision of full health benefits during the entirety of leave, paid and unpaid
$1,000 bonus for expenses after baby arrives
Continued vesting of stock options during paid leave, and paused vesting of stock options during unpaid leave

We updated this policy because we are committed to building a culture and environment where not only our business grows, but our people do as well. I am passionate about the improvements we’ve made to our parental leave policy. I want to be part of building a great company, and this is how I believe we do that: by building a great culture, turning our core values into action, and understanding that Enigma is not only a great place to come and build a career, but also a place to lay the foundation for your future.

Updated July 09, 2020:

As our society and culture progress, so too, should the language that is used for parental leave. With this in mind, we have updated our Parental Leave Policy this year to better reflect fairness and equality, while keeping the same generous spirit and even offering additional leave to new parents.

Our previous policy allowed for 20 weeks of paid leave for “primary caregivers”, and 10 weeks of paid leave for “secondary caregivers”. We no longer classify parents into primary and secondary statuses, but rather we will give all parents 12 weeks of paid leave. In addition, if you go through childbirth you are eligible for an additional 8 weeks of paid parental leave (still totaling 20 weeks altogether).

Likewise, all new parents will continue to receive:

Up to 40 or 36 consecutive weeks of unpaid leave, totaling up to a year of time off
Continued provision of full health benefits during the entirety of leave, paid and unpaid
$1,000 bonus for expenses after the baby arrives
Continued vesting of stock options during paid leave, and paused vesting of stock options during unpaid leave

As Enigma continues to grow and mature, we are proud to keep our most important policies, the ones that lay the foundation for our employees' futures, progressive, relevant, and up-to-date.

What is Public Data?

Enigma — Tue, 20 Feb 2018 00:00:00 GMT

The Federal Communication Commission's net neutrality repeal has resulted in heated debate on the future of a free and open Internet. While a pertinent conversation, it often fails to address broader questions on the exact nature of open information:

What does it mean for information to be open?

What is open data?

How does open data differ from public data?

The answers become more nuanced when we consider factors such as access, redistribution, maintenance and structure.

According to the Open Knowledge Foundation definition, “Open data and content can be freely used, modified, and shared by anyone and for any purpose.” While this provides some helpful insight, it does little to hold open data to a technical standard. For that, we turn to the inventor of the World Wide Web, Tim Berners-Lee, who developed a 5-star scale for the quality of open data. His scale is as follows:

Make data available online and under an open license
Make it available in a structured format (i.e excel)
Make it available in an open structured format (i.e csv)
Use URIs for denotation
Link data to other data to offer context

The Open Data Institute adds further color by providing an open data certificate to verify a data publisher uses best practices to uphold data dependability. These practices include timely data updates, the presence of a data maintainer who provides metadata on changes, and the availability of historical data.

Today, there is an implied standard to open data: often structured, machine readable, open licensed and well maintained. Additionally, open data is free. The same does not necessarily hold true for public data.

Public data can be defined as all information in the public domain, encompassing anything from a monthly updating dataset on a government data portal to PDF files that are only accessible via Freedom of Information requests (and everything in between).

Open data is, by definition, easy to access. Public data on the other hand can be trickier, sometimes requiring a Freedom of Information Act (FOIA) request. For those unfamiliar, submitting a FOIA request to a government agency can be a real a test of patience, taking months to receive a response and sometimes costing a fair amount of money.

Datasets that otherwise do not require a FOIA but are purchased from government agencies may also be public data. However, they are certainly not open, as they are not free. In one case, an open data activist in Virginia purchased the state’s corporate registration data for two years before turning around and publishing it for free. His pressure to make this information more widely available resulted in the state eventually publishing the data for free and for everyone.

Enigma Public makes a continuous effort to FOIA for politically relevant or otherwise interesting datasets. We offer all our datasets in machine readable format (downloadable as a CSV or accessible via our API), even when data at the source is anything but.

Conversation on what data transparency means and its pertinence to public knowledge goes beyond the Enigma offices. As the Inter-Parliamentary Union prepares for its 2018 World e-Parliament Report, we look forward to changing legislation as governments strive to increase the standard of their public data.

CXO Talk: Healthcare Innovation with Data and AI

Enigma — Fri, 02 Feb 2018 00:00:00 GMT

Data, artificial intelligence and machine learning are having a profound influence on healthcare, drug discovery, and personalized medicine. Enigma CEO Hicham Oudghiri joins CXOTalk host, Michael Krigsman, and Milind Kamkolkar, CDO of Sanofi, to discuss how data is changing the healthcare industry.

Web Summit 2017: Open Source and Data Opportunities

Enigma — Wed, 08 Nov 2017 00:00:00 GMT

Hicham Oudghiri, co-founder and CEO of Enigma, joins PJ Hagerty and Kris Borchers at Web Summit 2017 to explore the impact of open source on data and coding communities, and discussing the challenges and opportunities it affords.

The Challenge of Big Data (McGowan Forum on Ethics)

Enigma — Thu, 26 Oct 2017 00:00:00 GMT

Marc DaCosta, co-founder and Chairman of Enigma, joins a panel of writers, corporate leaders and Government officials at the National Archives in Washington, DC, to examine the ethical responsibility of those who compile and track citizen’s personal data.

Angel Nguyen Swift, Former American Express Executive, Joins Enigma

Enigma — Thu, 28 Sep 2017 00:00:00 GMT

Enigma Taps Industry Expert from American Express to Lead Financial Services Compliance Solutions

Angel Nguyen Swift brings more than 17 years of experience combating financial crime and building compliance strategies in the Fortune 500 to the fast-growing technology company.

New York, NY — Enigma, an operational data management and intelligence company that helps global organizations bridge the gap between data and smarter workflows, has hired industry veteran Angel Nguyen Swift to serve as Vice President of Compliance and Financial Crimes Solutions.

Angel will spearhead Enigma’s initiative to aid leading financial institutions in transforming their compliance efforts into intelligence assets. This move furthers her passion for bringing best in class tools and technology to the virtual desks of compliance units, enabling personnel at all levels to maximize capabilities and gain meaningful intelligence in their areas of expertise. Enigma’s compliance solution brings intelligent automation and linked entity assets to workflows, minimizing risk, increasing efficiency and reducing the astronomical investigative burden on compliance teams.

Angel joins Enigma from American Express, where she served as Vice President of the Global Financial Crimes Compliance – Financial Intelligence Unit (FIU). There, she played an active role in re-envisioning and rebuilding the FIU from the ground up, ultimately creating a global team of over 300 people in more than 40 countries. This centralized team is now responsible for the end-to-end SAR process (monitoring, investigations and reporting), Sanctions screening, PEP/EDD reviews, and Anti-Corruption payments monitoring across the enterprise. Prior to American Express, Angel served as a New York County Assistant District Attorney, where she prosecuted a plethora of cases including street level violent crimes, sex crimes and led long-term identity theft and cyber crime investigations.

"We're honored to have Angel leading our rapidly-expanding financial services compliance vertical,” said Hicham Oudghiri, CEO of Enigma. “Her wealth of experience and unwavering commitment to bringing unparalleled tools and technology to the field of compliance is a huge asset to both our team and the industry more broadly."

“Throughout my career, I've been given amazing opportunities to realize and live my passions every day — to learn and to find creative, responsible and impactful answers to challenging questions,” said Swift. “Joining Enigma allows me to harness these passions and focus on industry-wide collaboration. I'm incredibly excited to be part of a team dedicated to building effective and innovative approaches to the pressing demands facing the AML and Compliance industry.”

About Enigma

Enigma, a New York-based operational data management and intelligence company, bridges the gap between data and smarter workflows to streamline operations and drive intelligent decision-making at scale. From building systematic compliance programs that help prevent financial crime to tracking adverse effects of pharmaceuticals to enhance patient safety, Enigma connects and enriches clients’ internal data assets to maximize value and efficiency. Leading Fortune 500 companies, including American Express, ADP and Merck, depend on Enigma's unique ecosystem of modular technology, vast library of public data and advanced data-linking capabilities.

Data in the Real World: Fast Forward

Enigma — Thu, 28 Sep 2017 00:00:00 GMT

In this episode of Fast Forward, Dan Costa talks to Hicham Oudghiri, co-founder and CEO of Enigma, about connecting public data to the real world and making it open, and actionable.

Moving to Parquet Files as a System-of-Record

Jeff Knupp — Mon, 28 Aug 2017 00:00:00 GMT

Enigma is home to the world's largest repository of public data. Organizing, updating, maintaining, and indexing all of that data is no small feat. To do so, we were using a combination of technologies to power various parts of the system:

CSV files on Amazon's S3 as the primary entry point and format for data
Elasticsearch with a heavily customized document structure and dynamic index creation algorithm to allow full text queries over our sparse, heterogeneous data sets (something Elasticsearch is notoriously bad at)
Postgres as a resilient backing store for all data

During the evolution of what would become Assembly, Enigma’s platform for searching, storing, and enriching data, the system queried Postgres directly less and less often (as Elasticsearch queries were typically much faster than the associated Postgres queries). By the time our community data platform Enigma Public went live, the system only queried Postgres when a user wanted to export an entire data set as a CSV—and for a few very client-specific use cases. In reality, we were barely using Postgres, but it was a large line-item in our monthly AWS bill. However, to trust Elasticsearch as the sole source of data was, ahem, risky at best. Postgres's reputation for resiliency and reliability let us sleep easier at night (while costing us a fortune, of course). We knew Postgres could be replaced, but there are a thousand different ways to solve the "I need a single, canonical source for all my data" problem. Without any constraints to guide our decision, we may as well have thrown darts at a very boring dartboard. For more senior engineers, your Spidey-sense should be tingling: When you have too many possible solutions that seem equally good, you need to further constrain the problem. Novice engineers think of constraints as, well, constraining. But in truth, they act as pruning shears for the large branching tree of possible solutions.

Enter the Dragon... er, Product Managers

Every engineer's worst nightmare is to hear "so Product wants us to ship feature X by Y", where X may or may not have anything to do with your product, and Y is negative. Of course, as an engineer I think I know exactly what the customer wants; as a senior engineer, I know that's BS. Product Managers are the stewards of the product and meant to represent the voice of the actual customers (i.e. not the imaginary ones for whom you've already decided what they want).

On the flip-side, once every thousand years or so the engineering gods (on whose deaf ears every engineer’s prayer falls) take a break from writing the next volume of The Art of Computer Programming and perform a single miracle: Product's ideal direction for the product and Engineering's ideal direction for the product just happen to align perfectly. When you see it happen in the wild (and I've been privy to an actual honest-to-god bike-shedding argument, so my Software Safari creds are solid) everyone just kind of looks at each other a bit frightened. Surely this didn't just happen by chance?

Such was the case for Enigma Public and the foundation on which it runs, Enigma Assembly. Product wanted to take the next logical step in surfacing and organizing useful data sets: create derivative data sets by joining, filtering, and/or enriching existing data sets. The implementation, of course, is left to engineering, but this step represents a non-trivial shift in the way we think about and work with data at Enigma.

Meanwhile, Engineering (and SysOps) were looking for a less expensive and, more importantly, horizontally scalable solution to the System-of-Record issue. CSV files on S3, while overly simple, was actually much closer to the kind of data storage solution we wanted to use than Postgres. One big drawback of CSVs (among many, many other drawbacks) is their lack of schema information—indeed metadata in general less column names. Column type-inference libraries for CSVs are actually pretty good. But as we already knew the schema for new data sets before creating the CSVs and wanted to be sure the resulting schema always matched our definition.

Mean-meanwhile, the Enigma Data team was ramping up their use of Spark for various super-secret (and super-cool) machine learning projects. Hitting the Assembly API to download the CSVs of thousands of datasets, load them onto HDFS, and deserialize them with Spark proved to be an enormous bottleneck. They wanted the data in a format supported by Spark that took less time to deserialize than CSVs. They also wanted a simple SQL layer to be able to query the raw data from.

What's the simplest thing that could possibly work?

Before you go and call every vendor of systems even tangentially related to the problem you're working on just for the free dinners and helicopter rides (as I am wont to do), a useful exercise is to take your list of requirements and, to shamelessly steal a phrase from the Test Driven Development folks, ask "What's the simplest thing that could possibly work?" Put another way, what might an ideal solution look like if one removes all non-Essential Complexity?

To recap, here are the goals of our storage system:

Distributed, and accessible simultaneously from other distributed systems
Compact wire-format, as these data sets will likely be transferred quite often
Support for, at a minimum, Spark and Python
Efficient serialization and deserialization of data sets across supported systems (serialization and deserialization is a common bottleneck in many "Big Data" applications)
Capacity to make common, SQL-like operations (join, filter, add new columns) on existing data sets without requiring heroic data manipulation/transformations
Ability to power a SQL interface, either directly (if it's a DBMS) or indirectly (e.g. Amazon Athena on S3 files)

In which I make you do work

Let's actually work through this thought exercise. When it comes to data storage, the simplest and most fundamental building blocks are files. Our goal is to use a simple, file-based design for our system. And because we're going to be accessing these files using multiple distributed systems, we'll need some kind of distributed storage service to hold the data. Note, it needn't be a fully distributed file system like HDFS; all we require is a system that can map a file name to its contents—essentially an Object store.

For the files themselves, choosing the right format will be the key. The CSV format is a decent start (it is certainly simple), but we know that it is not able to encode schema information in the file itself (nor is there any standard way to encode it elsewhere). The wire-format is also about as un-compact as it gets. While we could compress CSVs before sending, that's true of any file, and thus not a "real" solution to the "compact wire-format" requirement. While we're at it, the CSV format is about the worst format one could create for efficient serialization and deserialization.

Ideally, our file format would be self-describing, giving us the freedom to use a "schema-on-read" approach where we simply dump the files somewhere (without first specifying their schema, as would be required in a system like Postgres) and decode the schema only when accessing them. That would allow us to tick the last requirement, powering an SQL interface, as many systems support creating SQL interfaces over file formats of this type. It's also much simpler than requiring a separate metadata store (a la Hive) with "table" definitions.

Perhaps the most restrictive (and thus most useful) requirement is the ability to make joining data sets and adding new columns to existing data sets "easy". Since most file formats store data row-by-row, this seems like a non-starter. After all, how would we add a new column to an existing data set? Short of essentially reading the data, jamming the new column's value in row-by-row, and then writing it out to a new file, there is no obvious simple solution.

So we know the kind of system we'd want, but are a bit stuck on the file format. Luckily, through the use of two new-ish Apache projects and Amazon S3, we can build our "simplest possible system" rather easily.

The perfect storm (but not "Storm" the streaming processing system from Twitter)

In describing the needs of the various teams earlier, I left out one small detail. The Enigma Data team didn't just ask for any old Spark-compatible data format that could be efficiently deserialized, they asked for a specific format. What they actually said was "and could you store the files in Parquet format on S3?"

If you're like me, you probably would have responded to that request in a manner similar to "you want me to put what, where?" After the Data team showed me “Google” and how to use it to search the entire Internet (what a time to be alive!), I came across Apache Parquet. Parquet is, wait for it... a file format. But not just any file format! It's a columnar format. In a columnar storage format, rather than storing data essentially as a list of independent rows, each file contains the values in one or more columns of data (the previous link has a nice, straightforward example). Parquet, in particular, also includes the schema of the data alongside the data itself at the end of the file (why the end rather than the beginning is left as an exercise for the reader). Columnar formats and systems based on them are rather new, so don’t worry if this is your first exposure to them.

By now, you’re probably sick of saying the word "columnar" in your head and are wary of the benefits over "row-ular" ™️ data. Let's discuss a few of those benefits:

Data for a single column is stored contiguously and all values share the same datatype, allowing you to compress the bejeezus out of the data using simple and well-known compression tricks. In addition to these tricks, Parquet supports using actual compression algorithms on the data—and even different algorithms for different columns of the same table.
When doing analysis on large data sets, it turns out "apply the following function to every value in this row" is not the most common data access pattern. Rather, it is much more likely that some subset of columns are needed at a given time (Pandas users, back me up). Arranging data by column means that columns unused in a given query never need to be read from disk—a huge performance boost for common operations on large-ish datasets.
In the brave new world ushered in by Big Data, a full data set rarely fits in memory. Therefore, diskaccess patterns have become an extremely important differentiator of storage systems. Serialization and deserialization of data written in a columnar format is usually much faster due to the fact that a given column's data is stored contiguously. That has locality (e.g. referential, temporal) wins written all over it.

So "Parquet files on S3" actually seems to satisfy most of our requirements:

Its columnar format makes adding new columns to existing data not excruciatingly painful
Files are compressed by the encoding scheme resulting in hilariously small Parquet files compared to the same data as a CSV file
All major systems provide "a SQL interface over HDFS files" support Parquet as a file format (and in some it is the default)
Spark natively supports Parquet
S3 handles all the distributed system-y requirements

This should be simple! Or, Why I've been writing C and become a contributor to three open source projects

In fact, there is only one hard requirement missing from "Parquet on S3" (I tried to shorten that, but could only come up with "PoS" and "PS3"): Python compatibility. At the time, Parquet existed as both a spec and a reference implementation in Java. Only Java. This is to be expected, though, as Parquet is based on the Google paper describing Dremel and, as we all know, every technology described in a Google paper is quickly followed by an Apache project implementing the technology in Java. Alas, even my witty observations could not help us. With an entire backend written in Python, adding Java to the mix for such a small task was unpalatable.

I put my new-found Google skills to work and came across two tightly coupled projects: parquet-cpp and Arrow. The former is a C++ implementation of the Parquet format and the latter is interesting enough to deserve its own sentence. Arrow is a close analogue to Parquet, only the storage medium is memory (RAM) rather than disk. That is, Arrow is a columnar in-memory data format and series of libraries. At the risk of oversimplification, "Arrow : Memory :: Parquet : Disk". It also provides libraries for a growing number of programming languages.

One might ask why we are even discussing Arrow. After all, we should be able to generate Python bindings using parquet-cpp, right? Well, Arrow takes care of that, as well as the part we haven't given much thought to yet: if we want to use Parquet as the output format, what intermediate formats does it support? For Python, the answer is "Arrow", in the form of the pyarrow package.

pyarrow is a first class citizen in the Arrow project: a good deal of time and effort has been spent implementing the features on the Arrow roadmap. And since Arrow is so closely related to parquet-cpp, support for Parquet output (again, from Python) is baked-in. Of course, this is starting to sound like turtles-all-the-way-down. We've now shifted the question "what intermediate formats does Parquet support" to "what intermediate formats does Arrow support?" or, "How does one construct an Arrow Table?". The answer, interestingly enough (you’ll see why I say that in a bit), is to use Pandas.

Now, given that we already know we have, or can create, CSV representations of data sets, the sequence of steps to get to "Parquet on S3" should be clear:

Download and read a CSV file into a Pandas DataFrame
Convert the DataFrame into an pyarrow.Table via Table.from_pandas()
Output the Table as a Parquet file using pyarrow.parquet.write_table(our_table, some_filename)

This should be a piece of cake!

Spoiler alert: there is no cake

While both Arrow and parquet-cpp were still pre-1.0, there were/are a number of companies using both successfully in production. Few, however, it seemed, were working with CSV files of the magnitude we were used to (up to tens of GB). In addition, some Parquet implementations (cough Spark cough) had made some rather odd implementation choices.

The one that affected PoS (I've given up, let's get the giggles out now) directly was Spark's use of the int96 type to represent DATETIMEs. Now, in their defense, when they were implementing Parquet support there was only one other system that could actually output Parquet, and that was Impala. And Impala used int96 because <insert plausible explanation here>, so no one is actually to blame. Of course, once other systems started supporting Parquet output, Spark faced pressure to adopt the more "conventional" int64 type to represent DATETIMEs. Cue lots of Jira tickets, GitHub issues, Slack discussions, and email threads.

I wouldn't become aware of this fact until a bit later, as when I started work on PoS parquet-cpp didn't support DATETIMEs full-stop. Once support was added, I was happily generating Parquet versions of every data set in Enigma's public data repository—and the Data team was happily loading some percentage of those successfully into Spark. The rest were flat-out rejected due to a type mismatch, which is how I became aware of the int96 issue.

No matter! I would simply coerce DATE and DATETIME fields into Python/Pandas/numpy strings. This was fine (for a while, anyway) with our Data team as they didn't need to do any analysis on date data at the moment (though of course they needed to be able to load datasets with date data). And so I happily re-generated Parquet versions of every data set in Enigma's public data repository.

May you live in interesting times… and debug interesting bugs

During said regeneration, I noticed something curious. About 90% of the CSV to Parquet transformations worked just fine. For the remaining 10%, Pandas complained that the CSV had columns of mixed type. Knowing that this data already existed in Postgres with a set schema, that error message was a bit surprising.

A little digging revealed that the default behavior of the Pandas CSV parser is to operate over large files in chunks rather than reading the entire file into memory all at once. This can, in some cases (see the low_memory parameter), cause the column type inference code to be unable to determine a column's type. If all of the data is read at once, there is no such issue. This makes intuitive sense; if you can see all the data at once, you can definitively say if it's all one type or not. When you're operating over chunks of data, however, if any of the types inferred for each chunk doesn't seem to match the others, you can't make the same assertion.

No problem! As I said, I already had the schema of each of the CSVs and Pandas supports explicitly specifying the dtype of each column. And if for some reason that doesn't work, I could always read the entire CSV into memory (the file-generation process was running on a machine with 64 GB of RAM) and all the column types should be inferred properly. Both are parameters of pandas.read_csv(): dtypes=<dictionary mapping column name to numpy type>for the former solution and low_memory=False for the latter.

Surely at least one of those methods worked…

Spoiler alert: neither method worked. In fact, both methods uncovered bugs, though the bugs were distributed across three open source projects. When specifying dtypes, the interpreter core dumped within Arrow with the following stack trace:

<div class="code-wrap"><code>#0 __memmove_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181 #1 0x00007fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char const*, long) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #2 0x00007fbaa5c0ce97 in parquet::PlainEncoder >::Put(parquet::ByteArray const*, int) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #3 0x00007fbaa5c18855 in parquet::TypedColumnWriter >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #4 0x00007fbaa5c189d5 in parquet::TypedColumnWriter >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #5 0x00007fbaa5be0900 in arrow::Status parquet::arrow::FileWriter::Impl::TypedWriteBatch, arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr const&, long, short const*, short const*) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #6 0x00007fbaa5be171d in parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #7 0x00007fbaa5be1dad in parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #8 0x00007fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table const&, long) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1 #9 0x00007fbaa51e1f53 in __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, _object*) () from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so #10 0x00000000004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98 ... #34 0x000000000063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318 #35 Py_Main () at ../Modules/main.c:768 #36 0x00000000004cfe41 in main () at ../Programs/python.c:65 #37 0x00007fbadf0db830 in __libc_start_main (main=0x4cfd60 , argc=2, argv=0x7ffe6510b1c8, init=, fini=, rtld_fini=, stack_end=0x7ffe6510b1b8) at ../csu/libc-start.c:291 #38 0x00000000005d5f29 in _start ()</code></div>

This actually turned out to be the manifestation of three issues. The first was that the pandas_type in the pyarrow.Table's schema was mixed rather than string in some cases, which isn't a valid type for pyarrow. The second issue, and cause of the crash, was an integer overflow in one of the various offsets stored in the BinaryArray type, the type used for strings. The last issue was the absence of bounds checks for overflow that would have otherwise prevented this.

Door number two, please

OK, so that's unfortunate. But remember, we still have another option! We can let Pandas read the entire CSV into memory in one go and infer the column types. Since none of the individual data sets are larger than the 64 GB of RAM the machine has, this shouldn't be a problem.

It was. And it looked eerily similar to the Arrow issue. Pandas crashed while trying to allocate memory due to an integer overflow. The overflow occurred in an offset value tracking the current capacity of the buffer the CSV is being read into. But before I could even get to that bug, another bug was causing the Python interpreter to core dump while trying to raise an Exception to tell me, incorrectly, it was “out of memory” .

All told, the situation was... not ideal.

Patches on patches on patches

Clearly, I had to get at least one approach working (because: job). I submitted issues to Arrow and Pandas and created a reproducible example for each. For those who don't know, Wes McKinney just happens to be both the PMC of Arrow and Parquet as well as the creator of Pandas (this is why Arrow is so tightly integrated with Pandas). I discussed the issues with him on Slack (I had already contributed to Arrow before and was already on their Slack). Once it became clear the Arrow issue would require multiple changes from multiple people, I set to work on the Pandas issue.

The Pandas issue was like an onion (with, uh, two layers): one had to peel back and fix the first issue before the second would reveal itself. Also, debugging was a lot more time consuming due to the majority of code being written in Cython. Cython code looks like Python got bit by a radioactive K&R book and mutated into some weird hybrid. It's eventually compiled into highly optimized C code, so you have to have a pretty good handle on C to do anything non-trivial in it. That said, Python can make use of the resulting compiled library as if it were any other C library (i.e. seamlessly) and it can provide massive performance gains for some types of workloads.

Anyway, the first issue was that Pandas was raising an Exception with the message "out of memory" and then immediately core dumping. This was due to a pointer (meant to point to the address of the actual error message in memory) being dereferenced before memory for the error message was allocated. That's a complicated way of saying it was something like the following (in pseudo-C):

<div class="code-wrap"><code>struct buffer { ... char* error_msg; }; void parse_buffer(...) { ... if (make_stream_space(self, ex_fields - fields) < 0) { self->error_msg = "out of memory"; return -1; } }</code></div>

Every other time error_msg was set, it was preceded by self->error_msg = (char *)malloc(bufsize);, so this is just a case of someone forgetting to allocate memory for the error message before using it. Of course, it would be easy for someone to forget (or not know it was required at all) to allocate the memory for the error message before setting it. I have another GitHub issue open to pre-allocate the error_msg buffer, but for the moment just added in the missing allocation so I could continue.

/ragequit

Now I could successfully get Pandas to raise the "out of memory" exception without crashing. Of course, it should never have been raised in the first place, so it was time to fix the "real" integer overflow issue. The CSV tokenizer for parsing CSVs had an in-memory buffer to hold the data being parsed and the implementation was pretty straightforward. In C, you typically create a dynamically-sized array-like container using a structure that stores a pointer to the start of the memory buffer allocated for the array contents.

Since the length of the container is dynamic, it is initialized with a default size and grows the underlying buffer as necessary. To know when to resize (and what new size to request), you keep track of the current size (how much data has been added) and current total capacity. When data is appended and would cause size > capacity, it’s time to grow (resize) the buffer!

The buffer is resized using realloc(2), which takes two arguments: a void* to point to the currently allocated buffer and a size_t value to represent the desired capacity. Like malloc(2), it returns a void* to the newly allocated region or a null pointer in the case of failure. All of this is rather straightforward. For a buffer that doesn't ever grow past a certain size, everything works fine. However, the offsets mentioned earlier were stored as plain old ints. This proved to be problematic.

Two’s complement, not “two complements”

To understand why, recall that a 32-bit (signed) integer has a maximum value of 2^31, or about 2.1 billion. When talking about a byte array, that equates to 2GB. When the CSV tokenizer’s buffer needed to grow, the current int capacity would be doubled and passed as the desired buffer size to realloc(2). But as we saw, realloc(2) expects the second argument (the desired size) to be of type size_t, which is guaranteed to be unsigned (and, on most modern platforms, at least 64-bits).

Most modern systems represent signed integers using a method called “Two’s complement”. Adding 1 to a signed integer whose current value is 2^31 causes the new value to be negative 2^31 (not 2^32, as one might expect) and is said to have "overflowed". Unsigned integers, like size_t, have a maximum value of 2^32 rather than the 2^31 maximum for signed integers. While signed integers designate the first bit as the "sign bit" to indicate if the following 31-bits should be interpreted as positive or negative, unsigned integers are free to make use of all 32 glorious bits. When the signed integer overflows, the leading bit changes from 0 (positive) to 1 (negative). When that binary value is interpreted as an unsigned type (which is stored using 64 bits rather than 32), the value is very, very large.

So now we know why the error only occurred on large CSV files. Allocations that would grow the buffer > 2 GB would effectively be asking for an enormous amount of memory (which triggered the "out of memory" message). I fixed both issues and a few other minor things, submitted the PR, and was good to go. At the same time, Wes and company were finishing up the fixes on the Arrow side. The fixes were released as part of Arrow 0.5.0. The Pandas fixes would be part of the next release. And they lived happily ever after...

No, they didn't.

Arrow 0.5.0 was a "curious" release, especially from a Python perspective. After Wes published pyarrow 0.5.0 to PyPI, I immediately downloaded it and tested the new Arrow implementation with a patched version of Pandas that included my fixes. There were no longer any error messages about running out of memory. In fact, there were no messages at all! Nothing seemed to be happening with the process despite it reporting 100% CPU utilization.

I used gdb on the running Python interpreter and discovered it was stuck in the jemalloc library. In Arrow 0.5.0, jemalloc became the default memory allocator due to much better performance than the ol' libc allocator. But it looked like this change was causing issues.

Specifically, the code was stuck in a spinlock. Spinlocks are a low-level concept programs use when "I have to acquire a mutex but I expect it to be almost always available or held for a very short amount of time". Rather than trying to acquire the mutex and sleep()ing as is normally done, a spinlock just "spins" through attempts to acquire the mutex without pausing, hence the 100% CPU utilization.

It seems that using jemalloc for pyarrow while Python used the regular libc allocator caused issues between the two. Specifically, it looks like some of jemalloc's internal data structures were being corrupted. pyarrow 0.5.0 was immediately removed from PyPI and we worked on a short and longer-term fix. For the next release, jemalloc would not be used as the default allocator (it could still be requested during compilation, but for pyarrow the library would be compiled without it). In the longer term, I'm working on a fix that makes use of jemalloc prefixes so that the two allocators running in the single Python interpreter process will play nicely (which apparently one is supposed to do when using jemalloc with another allocator, though a lot of other projects have run into this as the documentation is a bit lacking. While investigating I discovered a huge Redis GitHub issue thread that described behavior identical to what we were seeing).

What’s the simplest thing that could possibly take two months of work?

And so, when pyarrow 0.6.0 was officially released a week ago, Enigma finally had the simple, straightforward System-of-Record comprised entirely of Parquet files stored on S3. In a final ironic twist, version 0.6.0 is also the first to support writing dates in the deprecated int96 format, so that issue is solved as well (and Spark changed to use 64-bit integers as of their latest release). Anyway, we'll be decommissioning our Postgres instances soon and are well positioned to support the direction the business is headed. If only every major architectural change were so simple…

Data 101: Metadata

Alexandra Northington — Wed, 18 Jan 2017 00:00:00 GMT

This is the first post in a series covering the fundamentals of operational data management. We’ll be walking through context, linking, liquidity and how these core concepts come together to enable enterprises to put data to work to drive more efficient workflows and generate repeatable insights at scale.

For data to be useful in informing processes or answering questions it must be contextualized. By providing the context to understand what the data is — information about how it’s generated, collected, and composed — you can identify relationships across multiple heterogeneous datasets and better understand how to apply the data to answer questions or solve for challenges in a scalable, repeatable way. Metadata plays a key role in this equation.

What is metadata?

Definition: A set of data [that gives information] about other data; A conceptual representation of knowledge. This information is added to key data fields — though not overtly visible to users — to enable machines (and humans) to understand the meaning of information.

Metadata describes the source of data (the technologies and methods) and helps establish a common understanding of the meaning of the data to ensure correct and consistent interpretation and usage of information. It can also encompass details about the usage and transformation of data.

I’m not sure we collect metadata. Is it really needed?

Yes! Metadata is fundamental to understanding data. Moreover, it is essential for leveraging data as an enterprise asset. Beyond being largely useful for finding relevant information, discovering related resources, verifying insights, and auditing analyses, metadata enables you (or a machine) to do a number of important things:

Reuse domain knowledge: See several different perspectives from same data
Understand relationships between entities
Leverage algorithms to de-duplicate, link, or match records
Assimilate new data faster
Mine data systematically
Understand how data has changed over time
Draw wider conclusions from data

What exactly does metadata tell you about data?

Context comes in a number of forms, each providing different color around a dataset. For now, we’ll focus on three key layers of metadata: user-contributed, derived, and provenance and lineage.

User-contributed: As the name suggests, this is metadata added by users who are familiar with the data. You can think of this as expert knowledge about the content in the form of annotations that help other users know what the data is and how to use it. This kind of information may include definitions of terms of columns or other details that may help to identify similar datasets.

Derived: With this type of metadata (mined from each dataset), we are essentially asking, “What can the data tell us about itself?”. Derived metadata can encompass information such as size, quality, and types of data, number of records, how often the data is updated, last modified date, anomalies, date ranges, and frequent keywords or tokens.

Provenance and Lineage: This third layer of metadata is tied closely to the creation, transformation, and usage of the data. It encompasses change history, social details around how the data is being used and applied across an organization, and dependencies: Is it a parent dataset? Is it used to create other datasets? Is it part of other datasets? This type of information is particularly important when trying to understand how a change in one dataset could have a much wider impact.

In subsequent posts we will explore additional layers of information such as ontologies and the mapping of the real world objects within datasets through specific relationships.

So, metadata helps you understand what you’re looking at.

While the meaning of data may seem obvious when presented within the system that collected it, if you were to look at that same data in a different setting, it might be challenging to understand.

Metadata describes what the data is so a human or machine can a) make sense of the information and b) identify and understand relationships that exist between concepts. This is particularly important when you’re integrating data from distinct sources, which may present equivalent concepts or data points in different ways.

Let’s look at phone numbers, for example. It’s likely you’ll encounter many different formats of telephone numbers out in the world. Take: 1 (212) 222 2222, +1 212 2222 or 1 212 222-2222. These may be written in slightly different terms, but any person who has used a phone would intuitively recognize these things as the same and would similarly recognize that 22-22 is not a valid phone number. Computers, however, don’t have the same intuition. Here, we could rely on semantic data management (the process of mapping a dataset or a type of data to a real world object or outcome) to map all of the formats above to the idea of “phone number.”

Or, imagine trying to transfer data from a form into a spreadsheet. The spreadsheet asks for the same information as the form, but the header names of each column are completely different from the field names on the form. Without semantic types, would you know where to put the information? Would someone else looking at the data know that the form and the spreadsheet contained the exact same information?

Sounds necessary for analyzing datasets from different sources.

Absolutely. Metadata enables people and software to share a common understanding of the structure of information, making it much simpler to extract, aggregate, and link information from different sources and systems. Metadata provides color to a number of things around the data for a person or machine that may not be familiar with it. This is particularly valuable because it allows someone (or something) to use a data set without having to refer back to the source system. It eliminates the need to integrate numerous systems or applications when analyzing information across multiple datasets.

In other words, metadata helps make data reusable for any number of purposes: a user can rely on metadata to re-contextualize data to answer a specific question without going back to the source system to get only a small subset of the data.

It offers a consistent view for information held in data siloed across teams or organizations, thus providing a greater body of knowledge from which to uncover answers or identify trends.

Big picture, what does metadata mean for operationalizing data?

Metadata enables you to apply knowledge and insights in a scalable, repeatable way. Think of it as a layer of information serving as the connective tissue for enterprise data analysis — a layer that makes it possible to create a more flexible and transparent framework.

This layer of embedded intelligence can then help power and optimize your data infrastructure, forming a key part of a feedback loop in which exposure to additional data will further enhance the metadata. In other words, your knowledge base will continuously grow (and grow more intelligent) as you assimilate new data from different sources.

The Challenge in Analyzing Adverse Event Data

Kelvin Chan — Wed, 30 Nov 2016 00:00:00 GMT

This is the first post in a series about the challenges and unlocked potential in analyzing pharmaceutical data at scale. Feel free to email me or post in the comments section about topics you’d like to see covered in this series.

Today, nearly 60% of Americans over the age of 20 take at least one prescription drug. According to a recent study, in 2015, 4.4 billion prescriptions were dispensed in the United States. Spending on medicines in the U.S. increased by 12.2%, reaching $424.8 billion (based on invoice prices). A key driver in this continued growth is the establishment of new brands, which contributed to over half of the total spending growth in 2015. As the pharma industry continues to grow, so too does the amount of data — data that can and will inform the development of drugs that shape the future of medicine.

Adverse Events

Among the numerous disparate streams of data flowing through the healthcare industry, adverse events (AE) data has the potential to make a profound impact on patient health. An adverse event (AE) is an untoward medical event (think side effect) that occurs when taking a drug (e.g. you took ibuprofen and got a skin rash). However, an AE does not indicate that a given drug is responsible for the side effect. Rather, it signals that someone has reported an outcome after a patient took a drug. In other words, it notes a co-occurrence, but not necessarily a causation. Patients, relatives, doctors, and nurses (or anyone, really) can report these events to pharmaceutical manufacturers or regulatory agencies like the FDA. Such events range from non-serious instances like drowsiness to more serious events resulting in hospitalization or even death. Tracking this information is critical to patient health and safety.

Beyond uncovering unknown (and potentially life-threatening) side effects of a drug, AE data can reveal troubling issues about a step in a manufacturer’s supply chain (e.g. a drug being tainted by a harmful chemical), issues with the drug’s mechanism of action, or even new adverse drug-to-drug interactions. Surfacing this information in a timely manner would enable manufacturers to change medication warnings and labelling, restrict usage, discard specific affected batches, or completely recall hazardous medications.

While AE data has long been considered a compliance resource to monitor drug-related side effects, this data also has the power to predict, identify, and prevent dangers that pharmacological interventions can have on patients. Analyzing this data in aggregate could produce real-world findings not revealed in clinical trials that reflect how specific patient profiles or populations respond to certain interventions. Unfortunately, AE data is largely fragmented and unstructured, making timely reporting and holistic discovery very difficult.

What makes adverse event data and reporting so challenging?

Volume: The massive volume of reports, ranging in the millions, can stifle the ability to effectively sort through each report. This volume also results in delays in reporting, as some processes take additional time for case managers to adjudicate and manage. In all this noise, it’s often difficult to sift through true signals of side effects.

Fragmentation: Today, AEs are fragmented across different country locations of a global pharma company, and public AE data from the FDA or similar country agencies live in their own silos. There is not standardized schema to merge this data.

Standardization: Much like real-world health data from claims and EMRs, AE data is messy and largely unstructured. Most AEs are reported in a manual process, as doctors or patients commonly file reports over the phone or via mail. The data is often comprised of non-standardized reports about interventions used or side-effects produced. While there are coding mechanisms to describe the type of AE (e.g. rash, stomach pains, headache etc.), coding standards vary over time and by country. The sheer volume of data makes it difficult to normalize the data at a scale that “old world” data infrastructure could manage.

Duplication: AE reports are often submitted multiple times by patients, nurses, or doctors to different regulatory bodies or manufacturers. The FDA may get the same report that the pharmaceutical company gets. Perhaps the doctor calls it in to the FDA and a patient reports it to the pharma company. This leads to a massive decentralization of data both publicly and within the enterprise, making it difficult to analyze from a single source of truth.

So, where do we go from here? In the next post we’ll walk through how manufacturers and regulators can normalize and manage data to gain a more holistic view of the drug development and patient health landscapes to drive proactive decision making.

2016 Enigma Momentum Report

Enigma — Thu, 03 Nov 2016 00:00:00 GMT

Our passion and enthusiasm for data is evident in every interaction we have - across teams, with our clients, and with the broader community. In that spirit, what better way to represent our massive growth over the past few years than with a data report? Enigma has come a long way since our inception. And this past year in particular has brought tremendous change and growth for our team.

Today, with over 30 new curious, driven employees, the seasoned experience of new engineering leadership, and thousands of valuable new datasets, we look ahead to a future that’s brighter than ever. But before we get there, let's look back on what we've been up to thus far.

Without further ado, we're excited to share the inaugural Enigma Momentum Report. [link to pdf]

Supporting Crunchbase as It Goes Pro

Enigma — Mon, 12 Sep 2016 00:00:00 GMT

Today at TechCrunch Disrupt, an event where Enigma has a long history, Crunchbase announced Crunchbase Pro. Why, you might ask, are we talking about this news here on our blog? The short answer is that we’re proud to be a data partner in this new endeavor alongside other great companies like Apptopia, Glassdoor, Product Hunt, and SimilarWeb. Together we’re seeking to provide the best intelligence to Crunchbase’s 25M+ users as they make smarter business decisions based on data.

The long answer is that Crunchbase Pro aligns closely with Enigma’s own beliefs and goals. As many of you know, Crunchbase is a trusted business information platform. It’s a resource many people rely on for accurate information about businesses, executives, investors, news, and more. At Enigma, our products break down barriers between people, technology, and infrastructure, enabling our customers to acquire, link, and apply data at scale for intelligent operations. With Crunchbase Pro, professionals from organizations like Bain & Company, Citibank, Deloitte, and Microsoft are able to benefit from the same capabilities, surfacing and applying data through a search infrastructure that the Crunchbase team’s rebuilt from scratch, custom lists, advanced analytics and more. Users can now discover new companies, people, and deals, as well as more complex information, such as the median series A investment in financial services companies over the last 90 days.

At Enigma, we believe that people should always ask why and be empowered to find the answer. That’s why providing data to Crunchbase Pro makes perfect sense for us. We believe in building new pathways to the data—and leveraging that intelligence to continuously adapt and move forward. It’s about having the data to not only answer every question, but also to spark new questions.

From companies searching for partners to collaborate with, to investors seeking new opportunities, to entrepreneurs looking for relevant investors, as of today, Crunchbase Pro enables users to ask harder questions, get better answers, and make more confident business decisions. That’s a mission that we can get behind. At Enigma, we understand that data has the power to spark meaningful change: when data flows seamlessly throughout an organization, playing a role in answering every question, informing every decision, it unlocks new value and drives measurable impact.

We’re excited to be a launch partner and a source of key data to Crunchbase Pro as they seek to help more professionals answer harder questions every day to improve the business they’re getting done in the technology sector and beyond.

To learn more visit: pro.crunchbase.com

Smoke Signals

Enigma — Thu, 24 Sep 2015 00:00:00 GMT

The problem

We can expect 25,000 people to be injured or killed by fires in the United States this year. With over 130 million housing units across the country, 4.5 million of them do not have smoke detectors, placing their inhabitants at substantial risk. Driving this number down is the single most important factor for saving lives put at risk by fire.

A broad range of people are trying to address this issue, from local fire departments to the Red Cross. However, they all face the same problem: Which door do we knock on first?

What we're doing about it

Prioritizing outreach for fire prevention is difficult without access to data and analytics targeting at-risk houses. By drawing together different public data sets, we developed a predictive model that identifies hot spots of homes that are unlikely to have smoke alarms.

In collaboration with the Red Cross, DataKind, and local fire departments, we are working to help get smoke alarms where they need to be. The goal is to provide a tool that helps fire departments and other groups work more efficiently. Local fire department outreach coordinators can combine their on-the-ground knowledge about their areas with trained models scored at a newly enhanced geographic granularity. This is only a first step. We are also releasing all of the data, components and algorithms that make this tool work in hopes that others can improve upon what we have begun.

The background

In November of 2014 there was a house fire in the Broadmoor neighborhood of New Orleans that killed five people, including three children. The house did not have a smoke alarm. Enigma began working with New Orleans' Fire Department and Office of Performance and Accountability to develop a model that identified New Orleans blocks least likely to have smoke alarms, and most likely to experience a fire fatality. This enabled the New Orleans Fire Department to conduct a door-to-door outreach campaign that places smoke alarms in as many at-risk homes as possible. Drawing on our learnings from New Orleans, we extended the model to apply to cities within the 30 largest 'Metropolitan Statistical Areas', in hopes that more people can use and improve on our insights.

How we did it

This tool is an example of applied data analytics aimed at improving the lives of people in cities. Below is an overview of the components and data involved in making it work.

US Census Data

The Census' American Housing Survey provides extremely detailed information about residential housing that is normalized at the level of entire cities. The American Community Survey provides extensive demographic data about households within very granular census blocks. We linked the two together to provide the basis for our model. This enabled us to understand which census blocks are least likely to have smoke alarms installed and most likely to have at-risk populations. Learn more and access the data here.

TIGER Geocoder

Knowing where in the world an address is located can be a difficult problem. The US Census has released a database of all the streets in the country and how to geocode them. This was important for helping generate a list of addresses towards which to focus outreach efforts. We also released a version of TIGER that makes it easy for anyone to spin up a geocoder. Read more about it here.

Analytics

The basis of this tool is US federal government data. However, it becomes much more effective when local data on fire incidents is available. Analyzing where fires have historically occurred enables risk models to be tuned so they can be more sensitive to where the risk actually lies.

Mike Flowers, Former NYC Chief Analytics Officer, Joins Enigma

Enigma — Fri, 12 Sep 2014 00:00:00 GMT

We’re very pleased to announce that Mike Flowers, former Chief Analytics Officer of NYC under Mayor Bloomberg, will be joining Enigma as our first CAO. Mike brings an extraordinary depth of experience regarding how data and analytics can be leveraged to improve operational efficiency and to drive smarter decision making across a variety of contexts. During his tenure as the founding member of NYC’s Mayor’s Office of Data Analytics, Mike transformed a wide range of city operations from public safety, disaster response and sustainability, public health, finance, and economic development. Doing so meant breaking down barriers across over 40 agencies in order to surface actionable insight that allowed the Bloomberg Administration to intelligently coordinate the work of New York City’s 300,000 employees and $70 billion annual budget. Well beyond New York, Mike is a respected leader in combining open data and analytics, having been twice recognized by the White House for his work, in addition to advising governments and companies around the world.

At Enigma, Mike will be leading the development of an enterprise-focused analytics platform. As the information revolution transforms critical sectors such as finance, manufacturing, and supply line logistics, enterprises are accruing data about their operations at a breakneck pace. Time and time again, however, we hear our clients’ frustrations regarding the difficulty of sharing and querying that data globally across their organizations. From our perspective, this represents both a cultural and technical challenge that must be addressed simultaneously through technological innovation and a deep understanding of the workflows and processes that drive particular organizations.

Enigma provides unique value to the enterprise by delivering the tools to break down internal data silos and a platform for analyzing enterprise data in concert with public data, itself a reflection of economic activity as it is being planned, executed and measured. As we continue growing in this space, and providing actionable and measurable insight for our customers, we are thrilled to welcome Mike to the team!

Enigma Selected for FinTech Innovation Lab

Enigma — Fri, 14 Mar 2014 00:00:00 GMT

For the past 12 weeks, Enigma has been participating in the Fintech Innovation Lab, a mentorship program created by Accenture and the Partnership Fund for New York City and supported by more than a dozen of the world’s leading financial institutions. As one of only six selected companies chosen by chief technology officers from 15 participating financial institutions, we’ve been fortunate to receive high-level mentoring, product development advice, and dozens of enlightening discussions with executives in the finance and venture capital world.

The program has been a powerful opportunity for us to work closely with experts in the financial industry to collaborate on products which tap into the immense power of public data, and we’d like to thank all of the mentor companies involved, as well as our colleague startups Kasisto, LMRKTS, pymetrics, RevolutionCredit and Standard Treasury.

At the end of the program, Fintech hosts a “Demo Day” attended by executives from all sponsor companies, as an opportunity for participants to explain their business models to top influencers in the industry and gauge interest in future product directions.

For more information about this year’s Fintech Lab, here’s some of the press coverage from Demo Day:

Six Entrepreneurs Showcase Cutting-Edge Financial Services Solutions at 2014 FinTech Innovation Lab Demo Day in New York – Partnership for New York City
Six Fintech Startups That Wowed Bankers – American Banker
Funding the FinTech Pioneers – Financial Technologies Forum
****6 FinTech Startups to Watch – **FOX Business**

TechCrunch Disrupt: Startup Battlefield 2013

Enigma — Tue, 30 Apr 2013 00:00:00 GMT

Enigma co-founders Hicham Oudghiri and Marc DaCosta showcase the first iteration of Enigma on the way to winning TechCrunch Disrupt NY 2013. There's a sea of interesting public data out there just waiting to be tapped into, but there's a problem — most people have no earthly idea how to access it. And even if they're able to make some headway, there's still an untold number of connections between that data and even more data tucked away in another silo.

Sanctions Tracker

Enigma — Mon, 01 Apr 2013 00:00:00 GMT

Update: In April 2017, Enigma launched a tracker to monitor U.S. sanctions, which we maintained through September 2019. You can find up-to-date information about U.S. sanctions here.

United States Sanctions Tracker 1994 - 2019

What are sanctions?

Sanctions restrict business engagements of US companies and persons. The Department of Treasury’s Office of Foreign Assets Control (OFAC) imposes economic sanctions to achieve specific national security or foreign policy aims. Sanctions can be imposed on entire industries or in the case of the Specially Designated Nationals (SDN) list: targeted people, organizations, non-state actors and even boats. A program colloquially known as “North Korean Sanctions” for example, consists of a list of 82 individuals, 7 organizations, and 117 companies over the years.

What determines the list?

Executive action drives who is on the sanctions list. This is why the Sanctions Tracker is launching now — it will continue to monitor how the new Trump administration’s actions towards sanctions compare with those of the previous three administrations. Taking a data oriented approach to examining the past 20+ years of US sanctions makes it easier to contextualize and visualize changes to the program.

What are the penalties?

Since 1994 the SDN list has changed an average of 40 times per year. Businesses handling large volumes of transactions each day, such as banks, must keep up to date with each and every one of these changes, ensuring that no entity on either side of a transaction is subject to sanctions. Businesses caught engaging with a sanctioned entity are subject to fines, as well as potential civil and criminal penalty. Banks receive the largest financial penalties, such as the $8.9 billion fine paid by French bank BNP Paribas in 2014, but all business operating in the U.S. are subject. In 2016, for instance, Chicago-based PanAmerican Seed Company agreed to pay more than $4 million for violating the Iranian sanctions program.

Sanctions Program Landscape

Sanctions programs can be country-based or thematic. When appropriate, we have bundled several programs into larger categories. For instance, we combined 4 programs with titles such as “Specially Designated Terrorist [SDT]” and “Global Terrorism Sanctions Regulations [SDGT]” into a single “Terrorism” category.

General trends emerge when we group sanctions by category. Terrorism and narcotics sanctions eclipse country-level sanctions. That said, it should be noted that volume is not the only measure of the impact or effectiveness of a sanctions program. Effective programs are those that achieve their foreign policy aims, and targeted sanctions are only one tool. The Russia program, for instance, includes relatively few entries on the SDN list but entire industry sectors are potentially subject to sanctions, including financial services, defense, and energy.

Global Reach of Sanctions

Locating the Sanctioned

The US sanctions list encompasses businesses and people based all over the world. Even programs associated with a specific country will include associated businesses or known aides based in other countries, such as a lone Canadian company that was long on the Cuban sanctions list. The map below highlights the geographic spread of four groups of sanctions programs, those on Iran, North Korea, Russia and the “Kingpin” program targeting drug dealers.

Sanctions By Administration

Since 1994, the composition of the sanctions list has changed in reaction to global events as well as due to presidential predilection. Bush added far more than he removed (2549 added vs 274 removed). Obama removed far more entities from the list (1542) but he also added more (2595). These two presidents also differed in the types of entities sanctioned, with Bush added more vessels and companies to the SDN list than Obama.

The time between global events and additions to the list is not always immediate. For example, the United States continues to designate Al Qaeda individuals 16 years after the September 11th attacks.

Uneven Growth

Over the past twenty years, the overall size of the SDN list has increased, but the rate of growth — or in some cases the rate of reduction — has varied by sanctions program. In the past 20 years the number of sanctions against known terrorists have grown steadily and substantially, to well over a thousand entities at the start of 2017.

In comparison, the number of entities on the SDN list pursuant on the Iran sanctions list remained constant for years — before dropping as a result of the 2015 Iran nuclear deal.

Methodology

OFAC publishes a wealth of historical sanctions records dating back to 1994, but not in a format ready for analysis. The data must be parsed for unique identifiers like place of birth, passport and national ID number, aliases, and addresses, which in turn must be cleanly formatted, geocoded, and deduplicated.

For the Sanctions Tracker, we aggregated the 69 sanctions programs into 33 more general categories to more cleanly visualize the larger historical trends at work. Cleaning the data in this way enables answers to questions such as: In what year since 1994 were the most entities added to the list (691 in 1998) or were the most entities removed (1,077 in 1994).