On text-mining using Google search tools

Content warning: includes examples, motivated by the difficulty of changing species’ common names, that mention ethnic slurs.

Other Warning: longer than usual, and somewhat technical. You’ll be most interested in this post if you’ve ever thought about using web searches to explore changes through time in linguistic usage, interest in fields or topics, and so on. 

Over the last decade or so, my research interests have been sliding a little from science (evolutionary ecology and entomology) towards science studies. (Science studies, for those who don’t know the term, is more or less the study of how science is done and communicated.) This began, I’d say, when I was working on The Scientist’s Guide to Writing and thinking about the cultural norms we’ve developed around scientific writing; and it really took off when I was working on Charles Darwin’s Barnacle and David Bowie’s Spider and thinking about the cultural norms we’ve developed around scientific naming. Beyond those two books, you’ve seen my dalliance with science studies in two preprints (this one about humour in titles of scientific papers, and this one about how the etymology of scientific names may influence scientific attention paid to species). Hey, I did warn you in my very first post here on Scientist Sees Squirrel that I reinvent myself often – as a consequence of having a sadly limited academic attention span.

In a post a couple of weeks ago, I built further on my interest in science studies and naming, asking whether and how we can change the common names of species. My analysis leant heavily on some web search utilities, which I used to track the usage of different English names for species through time. Because I know I’m not the only person to consider using web searches as a research tool, I thought it would be useful to lay out some of the things I’ve learned about these. The bottom line: they offer powerful research tools, but they have some real limitations. I’ll discuss the use of Google Scholar, Google Ngrams Viewer, and regular Google web searches. Caveat: I don’t pretend to be an expert, and I haven’t conducted an exhaustive search of the available literature; but I can share some of what I’ve learned.

(1) Google Scholar

Most academics will recognize Google Scholar as a research tool, and know how to use it to find scientific papers. If you’re simply looking for papers to read on a topic, or looking for papers to cite to support a particular point, then the use of Google Scholar is pretty straightforward. But you can also use Scholar in other ways. You might use numbers of hits to track the growth (or decline) of a particular field or research question; or you might use Scholar as a text-mining tool, to track usage of particular words or phrases in the academic literature.

If you’re just looking for some papers to read or cite, you probably won’t mind that given a set of search terms Scholar generates a list of papers, not the list of papers. If you’re using Scholar for other purposes, that might be a crucial distinction (one that’s important to consider for any search tool), so let me expand on it.

We used Google Scholar in our naming-etymology-and-attention preprint, to compile a list of plant-feeding insect species for which a particular kind of scientific question had been asked.* One reviewer of the manuscript wanted us to specify very clearly how we did our search, because they were concerned we hadn’t provided enough information for a reader to replicate our search. The reviewer was quite right that we hadn’t said enough about how we did the search – but quite wrong to expect that we could give a reader enough information so could replicate our search. Google Scholar searches are, by design, not (fully) replicable.** That’s what I mean by Scholar delivering a list of papers, not the list of papers.

Any search applies an algorithm to a corpus. The corpus is the set of things that are searched. The algorithm is that process that leads to items being selected from the corpus and then presented to the user in some order. Both algorithm and corpus influence the nature of the search (including whether or not it’s replicable by different searchers).

For Google Scholar, the corpus is very large – but it’s not clear how large, or exactly what’s in it. A 2014 estimate suggested it contained (then) about 160 million documents, but what those documents are (what they include or exclude), or how fast the corpus is growing, is not information that’s publicly available. The algorithm for searching within the corpus is probably straightforward – it checks for the occurrence of search terms in the full text of documents making up the corpus – but beyond that, little is clear. It appears that the algorithm is “synchronously replicable”, by which I mean that two users searching simultaneously will get the same search results (we’ll see later that this is not true for regular Google web searches). However, it’s not “asynchronously replicable”: the same user searching at different times may get different results, as the corpus is continually expanded by Google’s web crawlers. That’s the set of results returned. What about the order in which they’re presented? Google Scholar can order search results by date or by “relevance”; the algorithm for relevance ranking is complex and secret, although it seems to rely heavily on citation counts among other things. Relevance rankings appear to be synchronously replicable, but are likely not asynchronously replicable since the ranking algorithm is presumably tweaked at least occasionally.

In summary: Google Scholar searches a very large but poorly defined corpus, and is replicable only for synchronous searches. This doesn’t mean it isn’t valuable, of course – but anyone intending to use it for text-mining style analyses should understand what the tool will and won’t do. Incidentally: most of this is also true for subscription-based academic search engines such as Web of Science and Scopus (although, as they say, a proof of this conjecture is beyond the scope of today’s post).

(2) Google Ngram Viewer

If you haven’t used Google’s Ngram Viewer before, don’t click this link unless you have some time to spare, because it’s really quite addictive. The Ngram Viewer searches Google Books for the frequency of occurrence of “Ngrams” – an Ngram is a single word (a 1-gram) or a short phrase (2-gram, 3-gram, etc.). Actually, you can choose among a number of different corpora based on Google Books – for example, English books, or just British English books, or Russian books). The results, plotted by year, are frequencies of the searched Ngram vs. the total of all Ngrams occurring in the corpus. The Ngram plot below, for example, shows the occurrence frequencies for “natural selection”, “double helix”, and “genetic code” in English books from 1800 to 2019, and it has some of the features you’d expect (although unless you’re a historian of biology, you might be surprised just how strongly usage of “natural selection” dropped off between 1900 and 1940).

Two weeks ago I showed you another example, in which I used Ngram Viewer to ask whether a recommendation from the American Ornithological Union to change the common name for a sea duck (from oldsquaw to long-tailed duck, to remove the ethnic slur in the older name) actually influenced usage in published books. Doing that analysis led me to explore some finer points of Ngram Viewer. Here’s some of what I found.

Like Google Scholar, Ngram Viewer searches an enormous corpus. As for Google Scholar, you don’t get to know quite what that corpus is. The full list of books included isn’t public, and in fact I haven’t even been able to find the number of books included – in 2011 it was about 8 million, including 4.5 million in English, but later corpora are larger. This means that any biases in the corpus (for example, an apparent overrepresentation of scientific literature) aren’t well understood. In other ways Ngram Viewer is more transparent, though. You can download the raw Ngram data, and searches are both synchronously and asynchronously replicable because you can specify a set corpus that won’t change for repeated searches.

Scholar has advanced search controls that let you do things like search for words by part of speech (e.g., find “search” the verb but not “search” the noun), search for multiple inflections of a word (e.g, find “search, “searched”, and “searching” all at once), compare occurrence frequencies across corpora or among search terms, and so on. However, advanced Ngram searches can be clunky and can fail somewhat unpredictably. Notably, some characters such as +, -, and * are used as advanced-search operators, and using them as operators when they also occur in the Ngram you’re looking for seems to be a recipe for chaos. (To be fair, Ngram Viewer is free and doesn’t even show you ads.)

So, if you want to use Ngram Viewer beyond the simplest kind of search, a bit of caution is probably in order. Actually, there are other reasons for caution: beyond the issue of the unknown biases in the corpora, there are effects of scanning errors, noisy metadata that give year-of-occurrence errors, and more; this paper suggests some strategies for mitigating some of the issues.

(3) Google web searches

If we can use Google Ngram Viewer to find trends in usage of words or phrases in books, what about using regular Google web searches to find similar trends for web pages? That’s obviously possible: an advanced Google search lets you bound your search in time (by day/month/year) and returns numbers of “hits” for each search. For example, the search string “double helix” after:2020-12-31 before:2021-04-01 will search for the exact phrase “double helix” in web pages dated between Jan 1 and March 31, 2021 – and it will tell you the approximate number of “hits” (32,700 as I write this). But interpreting this kind of result is a bit complicated.

“Complicated” has a few dimensions. But the single biggest issue with using Google web searches for serious analysis of trends in interest or usage is probably this: if Google Scholar is a little bit non-replicable, regular Google web searches are a lot non-replicable. Two users searching at the same time may find different numbers of hits, as may the same user searching at different times, or even at the same time using different browsers. There are many reasons for this. First, the searched corpus is always changing as Google’s web crawlers index more pages, or re-index existing ones. Second, the search algorithm is always changing; or at least, one assumes that it is (the algorithm isn’t public). Third, the algorithm is reputed to “tune” searches to deliver different results for different users and different browsers or devices. Fourth, the web is too large to be a single searchable corpus; instead, different search instances appear to run on indices stored on different Google servers; therefore, in effect they search somewhat (and unknowably) different corpora and may (or may not) extrapolate to estimate results for the “full” corpus. This is probably not an exhaustive list of reasons for non-replicability.

Google web search results are best thought of, then, as estimates of numbers of search results, based on searching samples from a larger corpus. Replicate searches can estimate sampling uncertainty, and comparing searches done by the same user can reduce that uncertainty (by removing the among-user component). The analysis I posted two weeks ago examined usage trends for two sets of older and proposed newer common names for species (following attempts by scientific societies to change each common name). Here’s how I proceeded, using a recommended change from gypsy moth to spongy moth (again, to remove the ethnic slur in the older name) as an illustration. For each quarter (Jan-Mar, Apr-Jun, Jul-Sept, and Oct-Dec) from January 2002 through September 2022, I had web searches conducted*** for each of three search terms: “gypsy moth”, “spongy moth”, and “Lymantria dispar”. Searches were case-insensitive and conducted in triplicate by the same user on the same device (to reduce sampling uncertainty. The sequence in which quarters were searched was randomized before each replicate search set, so that any changes in time in the underlying corpus (for example, as web crawlers index the existing body of pages more fully) wouldn’t be confounded with changes in time resulting from usage shifts.

Google web search results are numbers of “hits” in the searched corpus, rather than frequencies relative to anything else. As a result, they will be sensitive to both the number of web pages in the corpus (which will increase through time with the growth of the Web) and to general societal interest in the topic in question. Depending on the research question, one might want to correct for only the first effect, or for both. For my analysis, I was interested in asking whether the newer common name is replacing the older (not the gross use of either). Therefore, I divided the counts for the older and new common names by counts for Lymantria dispar. Since L. dispar (the moth’s scientific name) did not change, this should correct for both web-growth and interest-in-the-species effects. (The alternative, dividing counts for the older name by those for the newer name, would have done the same, but wouldn’t be able to detect any asynchrony in the waning of one name and the growth of the other.)

Here (above) are the results. You can see, as expected, moderate variation among replicate searches – the line connects averages of the three replicates, and the replicates are the small vertical ticks. It looks like normalization by the scientific name was successful in removing effects of the growth of the Web through time. Turning to the reason for my searches: you can also see some evidence for growth in the use of the new name spongy moth, but no evidence (yet) for abandonment of the older name following its “official” abandonment (left dotted line) or the announcement of its replacement (right dotted line). Formal statistical methods (for analysis of breakpoint in time series) would be useful here, but must be left for future work. For more detail about this particular case, see the full post here.

In summary: yes, you can use Google web searches to track linguistic usage, and/or interest in a topic, through time – but it’s not nearly as simple as you might think. (All the same issues presumably apply to other web search engines.)

(4) A broader issue: when are imperfect data better than none?

It’s easy to find problems with any of these three text-mining tools – I’ve mentioned quite a few, and there’s plenty of literature demonstrating others. So why use them? This comes, I think, to a fundamental question in science: when are imperfect data are better than no data? Google Ngram data, for instance, are based on an unknown corpus that’s likely biased in its representation of all published books. But it’s also something that doesn’t otherwise exist: an absolutely massive sampling of published books in which we can look for patterns.

We have two choices here: proceed with caution, or refuse and wait for perfect data. Presumably, that perfect data would be a complete corpus consisting of the text of all books ever published – so those choosing that option should get comfortable, as I expect it to be a long wait.

We almost always have those same choices. I work (some of the time) in forests. We know a lot about trees short enough to be sampled with a pole pruner, those that grow in thinned stands that are easy to walk through, and those that grow close to roads – just for starters. A very tall tree in an unthinned stand far from any road could be ecologically unique, and we’d probably never detect that! Not every system has those particular issues, but all systems have some issues. We shouldn’t ignore them; but if we insist on waiting for perfect data science simply won’t progress.

A big part of our job as scientists, then, is to find ways to proceed while mitigating the known issues with our data. Another big part is remaining skeptical and willing to update our understanding of the world when we get better data. You’ve seen here some of my attempts to deal with the problems in Google text-mining. Stick around a few years, and maybe you’ll see me change my mind about the patterns I’ve managed to reveal.

© Stephen Heard  December 6, 2022

Images: Google Ngram for “natural selection”, “genetic code”, and “double helix” produced using Google Ngram viewer; usage plots for the moth names are my own work.

*^In particular, we searched the literature to find insect or mite species that had been surveyed for “host-associated differentiation” – that is, for the evolution of genetically distinct forms associated with different host plants. Our particular reason for doing a Scholar search doesn’t matter, for today’s purposes, but if you’d like to know more you can read about the project (and access the preprint) here.

**^This didn’t matter to our use of the tool. We didn’t care if our search was replicable – that is, if we got the same list of papers as someone else might get, or that we’d get with the same search string a year later. We only cared that our search returned papers without bias in the etymology of the scientific names of the studied species. Our own recollection might well have this bias (read about our work to see why we think so!), but it seemed highly unlikely that Google Scholar’s algorithms would.

***^Thanks to Ben Dow for patiently conducting about a zillion searches.


1 thought on “On text-mining using Google search tools

  1. Pingback: Six months into “phased retirement”: how it’s going | Scientist Sees Squirrel

Comment on this post:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.