Sep 16, 2016 7:00 AM

Algorithms Could Save Book Publishing—But Ruin Novels

From analyzing a book's prospects to figuring out what subjects people are clamoring for, data is bigger in publishing than ever. But how much is too much?

Jodie Archer had always been puzzled by the success of The Da Vinci Code. She’d worked for Penguin UK in the mid-2000s, when Dan Brown’s thriller had become a massive hit, and knew there was no way marketing alone would have led to 80 million copies sold. So what was it, then? Something magical about the words that Brown had strung together? Dumb luck? The questions stuck with her even after she left Penguin in 2007 to get a PhD in English at Stanford. There she met Matthew L. Jockers, a cofounder of the Stanford Literary Lab, whose work in text analysis had convinced him that computers could peer into books in a way that people never could.

Soon the two of them went to work on the “bestseller” problem: How could you know which books would be blockbusters and which would flop, and why? Over four years, Archer and Jockers fed 5,000 fiction titles published over the last 30 years into computers and trained them to "read"—to determine where sentences begin and end, to identify parts of speech, to map out plots. They then used so-called machine classification algorithms to isolate the features most common in bestsellers.

The result of their work—detailed in The Bestseller Code, out this month—is an algorithm built to predict, with 80 percent accuracy, which novels will become mega-bestsellers. What does it like? Young, strong heroines who are also misfits (the type found in *The Girl on the Train, Gone Girl, *and The Girl with the Dragon Tattoo). No sex, just “human closeness.” Frequent use of the verb “need.” Lots of contractions. Not a lot of exclamation marks. Dogs, yes; cats, meh. In all, the "bestseller-ometer" has identified 2,799 features strongly associated with bestsellers.

What Archer and Jockers have done is just one part of a larger movement in the publishing industry to replace gut instinct and wishful thinking with data. A handful of startups in the US and abroad claim to have created their own algorithms or other data-driven approaches that can help them pick novels and nonfiction topics that readers will love, as well as understand which books work for which audiences. Meanwhile, traditional publishers are doing their own experiments: Simon & Schuster hired its first data scientist last year; in May, Macmillan Publishers acquired the digital book publishing platform Pronoun, in part for its data and analytics capabilities.

While these efforts could bring more profit to an oft-struggling industry, the effect for readers is unclear.

“Part of the beautiful thing about books, unlike refrigerators or something, is that sometimes you pick up a book that you don’t know,” says Katherine Flynn, a partner at Boston-based literary agency Kneerim & Williams. “You get exposed to things you wouldn’t have necessarily thought you liked. You thought you liked tennis, but you can read a book about basketball. It’s sad to think that data could narrow our tastes and possibilities.”

They Know What You Did Last Night

Once, publishers had to rely on unit sales to figure out what readers wanted. Digital reading changed that. Publishers can know that you raced through a novel to the end, or that you abandoned it after 20 pages. They can know where and when you’re reading. On some reading sites and apps, users sign in with their Facebook accounts, opening up more personal data. There’s a wrinkle, though: Companies such as Amazon and Apple have the data for books read on their devices, and they aren’t sharing it with publishers.

London-based startup Jellybooks offers a workaround. Publishers can hire Jellybooks to conduct virtual focus groups, giving readers free ebooks, often in advance of publication, in exchange for their sharing data on how much, when, and where they read. Javascript is embedded in the books, and at the end of each chapter, readers are asked to click a link that sends the data to Jellybooks. In almost two years, the company has run tests for publishers in the US, England, and Germany, and uncovered one sobering fact: Most novels are abandoned before readers are halfway through them. Jellybooks’s findings can guide publishers on their marketing, and even whether it’s worth signing an author again. “Hollywood moguls might do test screenings for movies to decide on how much [marketing] budget a movie should get,” says Andrew Rhomberg, the founder of Jellybooks. “That was never done for books.”

The ability to know who reads what and how fast is also driving Berlin-based startup Inkitt. Founded by Ali Albazaz, who started coding at age 10, the English-language website invites writers to post their novels for all to see. Inkitt’s algorithms examine reading patterns and engagement levels. For the best performers, Inkitt offers to act as literary agent, pitching the works to traditional publishers and keeping the standard 15 percent commission if a deal results. The site went public in January 2015 and now has 80,000 stories and more than half a million readers around the world.

Albazaz, now 26, sees himself as democratizing the publishing world. “We never, ever, ever judge the books. That’s not our job. We check that the formatting is correct, the grammar is in place, we make sure that the cover is not pixelated,” he says. “Who are we to judge if the plot is good? That’s the job of the market. That’s the job of the readers.”

We’re about to find out if the approach works. Inkitt recently announced it’s partnering with Tor Books, part of Macmillan Publishers, to publish the young adult fantasy novel *Bright Star *next summer. Author Erin Swan, a 27-year-old marketing writer who lives in Spanish Fork, Utah, couldn’t get an agent or publisher’s attention when she tried the traditional route, but Inkitt dubbed Bright Star a winner—and now it's heading to stores.

From Google Search to Amazon

And then there’s the idea of not even waiting for the book to be written. Five-year-old Callisto Media, based in Berkeley, California, uses big-data analysis to find out where there’s an audience clamoring for a nonfiction book that doesn’t yet exist—then hires someone to write it.

Benjamin Wayne, Callisto’s founder and CEO, says his company collects about 60 million pieces of consumer data a month. For example, Callisto studies the search terms Amazon suggests when users start typing in the first few letters, and found that people would frequently search for something that led to no results. “Consumers are searching for a piece of information, but no product exists to satisfy that consumer demand,” Wayne says. The approach has yielded titles that range from obvious (The Medical Marijuana Dispensary: Understanding, Medicating, and Cooking with Cannabis) to the less so (Everyday Games for Sensory Processing Disorder)*. *

Wayne says that, based on his own analysis, acquisitions editors pick a winner about 3 percent of time. “In the world of almost infinite consumer data, the idea that you cannot say with specificity what a consumer will want to buy seems frankly ludicrous,” he says.

Callisto eagerly pursues niche topics, hence titles like The Hashimoto’s 4-Week Plan, which is geared at readers suffering from the autoimmune disease. “We can be profitable on a book that sells about 1,500 copies,” says Wayne. “The traditional industry has to sell a multiple of that before they’ll begin to break even.” Callisto authors follow an outline dictated by data analysis and write quickly—the company aims to bring books to market in as little as nine weeks. After all, readers are Googling that information right now.

Publishers Weekly named Callisto Media one of the fastest-growing independent publishers for 2015 and 2016. So the company seems to have proven its point. But it’s worth asking: Do we only want to read about the things we’re already searching for? Don’t we risk losing the distinction between what’s important and what’s popular? As NPR noted last year, books nominated for prestigious prizes like the Man Booker Prize or the National Book Award typically don’t sell many copies.

“There could be books that don’t hit the bestseller list, which are most books, and they still have an enormous impact,” says Flynn, the literary agent. “They change the field of history, or science, or they have a policy impact, or they’re experimental in some way and inspire other writers.”

The Data Scare

As Archer and Jocker shopped the *Bestseller Code *manuscript to acquisitions editors, word of their powerful algorithm spread—as did worry and suspicion among those in the publishing profession. “The fear is we can homogenize the market or try and somehow take their jobs away from them, and the answer is no and no,” says Archer. “What the bestseller-ometer is trying to do is say, ‘Hey, pick this new author that you might not dare take a risk on with your acquisitions budget. Their chance is really good.’” Archer, now a writer in Boulder, Colorado, insists that she and Jockers, now an English professor at the University of Nebraska-Lincoln, are “literature-friendly” and want good books to succeed.

Andrew Weber, the global chief operating officer for Macmillan Publishers—whose St. Martin’s Press is publishing *The Bestseller Code—thinks algorithms should be viewed as an additional piece of information, rather than as an excuse to fire the editors. “Whether it’s in acquisition, whether it’s in pricing, whether it’s in marketing, whether it’s in distribution, there just seem to be many, many, many opportunities to improve the quality of our decision-making—and therefore hopefully our results—*by bringing data into the equation,” says Weber. “I would say we are still in the early days of that journey, but that’s the direction we’re headed.”

Archer and Jockers watched eagerly to see which novel would be their algorithm’s favorite. It turned out to be The Circle, a 2013 technothriller by Dave Eggers about working for a massively powerful Internet company. The Circle spent multiple weeks on both The New York Times hardcover fiction and paperback trade fiction bestseller lists. A movie version starring Emma Watson and Tom Hanks is expected in theaters this year.

The computer found much to love: a strong, young female protagonist whose most-used verbs are “need” and “want.” A three-act plotline that mimics the satisfying one found in *Fifty Shades of Grey. *A focus on three themes (modern technology, jobs and the workplace, and human closeness).

There was one thing, though, that the algorithm didn't pick up on. “The irony, of course, is his book is about suspicion of big data,” says Archer. “And here is a big data cache smiling at him.”

Susanne Althoff is an assistant professor at Emerson College in Boston.