Robot Humor

Text processing algorithms are notoriously bad at processing humor. The subtle, contradictory humor of irony and sarcasm can be particularly hard to automatically detect.

If, for example, I wrote, “Sharknado 2 is my favorite movie,” an algorithm would most likely take that statement at face value. It would find the word “favorite” to be highly correlated with positive sentiment. Along with some simple parsing, it might then reasonably infer that I was making a positive statement about an entity of type “movie” named “Sharknado 2.”

Yet, if I were indeed to write “Sharknado 2 is my favorite movie,” you, a human reader, might think I meant the opposite. Perhaps I mean “Sharknado 2 is a terrible movie,” or, more generously, “Sharknado 2 is my favorite movie only insofar as it is so terrible that it’s enjoyably bad.”

This broader meaning is not indicated anywhere in the text, yet a human might infer it from the mere fact that…why would Sharknado 2 be my favorite movie?

There was nothing deeply humorous in that toy example, but perhaps you can see the root of the problem.

Definitionally, irony means expressing meaning “using language that normally signifies the opposite,” making it a linguistic maneuver which is fundamentally difficult to operationalize. A priori, how can you tell when I’m being serious and when I’m being ironic?

Humans are reasonably good at this task – though, suffering from resting snark voice myself, I do often feel the need to clarify when I’m not being ironic.

Algorithms, on the other hand, perform poorly on this task. They just can’t tell the difference.

This is an active area of natural language processing research, and progress is being made. Yet it seems a shame for computers to be missing out on so much humor.

I feel strongly that, should the robot uprising come, I’d like our new overlords to appreciate humor.

Something would be lost in a world without sarcasm.

facebooktwittergoogle_plusredditlinkedintumblrmail

Normalizing the Non-Standard

I recently read Eisenstein’s excellent, What to do about bad language on the internet, which explores the challenge of using Natural Language Processing on “bad” – e.g., non-standard – text.

I take Eisenstein’s use of the normative word “bad” here somewhat ironically. He argues that researchers dislike non-standard text because it complicates NLP analysis, but it is only “bad” in this narrow sense. Furthermore, while the effort required to analyze such text may be frustrating, efforts to normalize these texts are potentially worse.

It has been well documented that NLP approaches trained on formal texts, such as the Wall Street Journal, perform poorly when applied to less formal texts, such as Twitter data. Intuitively this makes sense: most people don’t write like the Wall Street Journal on Twitter.

Importantly, Eisenstein quickly does away with common explanations for the prevalence of poor language on Twitter. Citing Drouin and Davis (2009), he notes that there are no significant differences in the literacy rates of users who do or do not use non-standard language. Further studies also dispel notions of users being too lazy to type correctly, Twitter’s character limit forcing unnatural contractions, and phones auto-correcting going out of control.

In short, most users employ non-standard language because they want to. Their grammar and word choice intentionally convey meaning.

In normalizing this text, then, in moving it towards the unified standards on which NLP classifiers are trained, researchers explicitly discard important linguistic information. Importantly, this approach has implications for not only for research, but for language itself. As Eisenstein argues:

By developing software that works best for standard linguistic forms, we throw the weight of language technology behind those forms, and against variants that are preferred by disempowered groups. …It strips individuals of any agency in using language as a resource to create and shape their identity.

This concern is reminiscent of James C. Scott’s Seeing Like a State, which raises deep concerns about the power of a centralized, administrative state. In order to function effectively and efficiently, an administrative state needs to be able to standardize certain things – weights and measures, property norms, names, and language all have implications for taxation and distribution of resources. As Scott argues, this tendency towards standardization isn’t inherently bad, but it is deeply dangerous – especially when combined with things like a weak civil society and a powerful authoritarian state.

Scott argues that state imposition of a impose a single, official language is “one of the most powerful state simplifications,” which lays the groundwork for additional normalization. The state process of normalizing language, Scott writes, “should probably be viewed, as Eugen Weber suggests in the case of France, as one of domestic colo­nization in which various foreign provinces (such as Brittany and Occitanie) are linguistically subdued and culturally incorporated. …The implicit logic of the move was to define a hierarchy of cultures, relegating local languages and their regional cultures to, at best, a quaint provincialism.”

This is a bold claim, yet not entirely unfounded.

While there is further work to be done in this area, there is good reason to think that the “normalization” of language disproportionally effects people who are outside the norm along other social dimensions. These marginalized communities – marginalized, incidentally, because they fall outside whatever is defined as the norm – develop their own linguistic styles. Those linguistic styles are then in turn disparaged and even erased for following outside the norm.

Perhaps one of the most well documented examples of this is Su Lin Bloggett and Brendan O’Connor’s study on Racial Disparity in Natural Language Processing. As Eisenstein points out, it is trivially impossible for Twitter to represent a coherent linguist domain – users around the globe user Twitter in numerous languages.

The implicit pre-processing step, then, before even normalizing “bad” text to be in line with dominant norms, is to restrict analysis to English-language text. Bloggett and O’Connor find that  tweets from African-American users are over-represented among the Tweets that thrown out for being non-English.

Dealing with non-standard text is not easy. Dealing with a living language that can morph in a matter of days or even hours (#covfefe) is not easy. There’s no getting around the fact that researchers will have to make difficult calls in how to process this information and how to appropriately manage dimensionality reduction.

But the worst thing we can do is to pretend that it is not a matter of concern; to begin our work by thoughtlessly filtering and normalizing without giving significant thought to what we’re discarding and what that discarded data represents.

facebooktwittergoogle_plusredditlinkedintumblrmail

Social and Algorithmic Bias

A commonly lamented problem in machine learning is that algorithms are biased. This bias can come from different sources and be expressed in different ways, sometimes benignly and sometimes dramatically.

I don’t disagree that there is bias in these algorithms, but I’m inclined to argue that in some senses, this is a feature rather than a bug. That is: all methodical choices are biased, all data are biased, and all models are wrong, strictly speaking. The problem of bias in research is not new, and the current wave of despair is simply a reframing of this problem with automated approaches as the culprit.

To be clear, there are serious cases in which algorithmic biases have led to deeply problematic outcomes. For example, when a proprietary, black box algorithm regularly suggests stricter sentencing for black defendants and those suggestions are taken to be unbiased, informed wisdom – that is not something to be taken lightly.

But what I appreciate about the bias of algorithmic methods is the visibility of their bias; that is – it gives us a starting point for questioning, and hopefully addressing, the inherent social biases. Biases that we might otherwise be blind to, given our own personal embedding in the social context.

After all, strictly speaking, an algorithm isn’t biased; its human users are. Humans choose what information becomes recorded data and they choose which data to feed into an algorithm. Fundamentally, humans – both specific researchers and through the broader social context – chose what counts as information.

As urban planner Bent Flyvbjerg writes: Power is knowledge. Those with power not only hold the potential for censorship, but they play a critical role in determining what counts as knowledge. In his ethnographic work in rural appalachia, John Gaventa similarly argues that a society’s power dynamics become so deeply entrenched that the people embedded in that society no longer recognize these power dynamics at all. They take for granted a shared version of fact and reality which is far from the unbiased Truth we might hope for – rather it is a reality shaped by the role of power itself.

In some ways, algorithmic methods may exacerbate this problem – as algorithmic bias is applied to documents resulting from social bias – but a skepticism of automated approaches opens the door to deeper conversations about biases of all forms.

Ted Underwood argues that computational algorithms need to be fundamentally understood as tools of philosophical discourse, as “a way of reasoning.” These algorithms, even something as seemingly benign as rank-ordered search results – deeply shape what information is available and how it is perceived.

I’m inclined to agree with Underwood’s sentiment, but to expand his argument broadly to a diverse set of research methods. Good scientists question their own biases and they question the biases in their methods – whether those methods are computational or not. All methods have bias. All data are biased.

Automated methods, with their black-box aesthetic and hopefully well-documented Git pages,  may make it easier to do bad science, but for good scientists, they convincingly raise the specter of bias, implicit and explicit, in methods and data.

And those are concerns all researchers should be thinking about.

 

facebooktwittergoogle_plusredditlinkedintumblrmail

Bag of Words

A common technique in natural language processing involves treating a text as a bag of words. That is, rather than restrict analysis to preserving the order in which words appear, these automated approaches begin by simply examining words and word frequencies. In this sense, the document is reduced from a well-ordered, structured object to a metaphorical bag of words from which order has been discarded.

Numerous studies have found the bag of words approach to be sufficient for most tasks, yet this finding is somewhat surprising – even shocking, as Grimmer and Stewart note – given the reduction of information represented by this act.

Other pre-processing steps for dimensionality reduction seem intuitively less dramatic. Removing stop words like “the” and “a” seems a reasonable way of focusing on core content words without getting bogged down in the details of grammar. Lemmatization, which assigns words to a base family also makes sense – assuming it’s done correctly. Most of the time, it doesn’t matter much whether I say “community” or “communities.”

But reducing a text – which presumably has been well-written and carefully follows the rules of it’s language’s grammar seems surprisingly profane. Do you lose so little when taking Shakespeare or Homer as a bag of words? Even the suggestion implies a disservice to the poetry of language. Word order is important.

Why, then, is a bag of words approach sufficient for so many tasks?

One possible explanation is that computers and humans process information differently. For a human reading or hearing a sentence, word order helps them predict what is to come next. It helps them process and make sense of what they are hearing as they are hearing it. To make sense of this complex input, human brains need this structure.

Computers may have other shortcomings, but they don’t feel the anxious need to understand input and context as it is received. Perhaps bag of words works because – while word order is crucial for the human brain – it provides unnecessary detail for the processing style of a machine.

I suspect there is truth in that explanation, but I find it unsatisfactory. It implies that poetry and beauty are relevant to the human mind alone – that these are artifacts of processing rather than inherent features of a text.

I prefer to take a different approach: the fact that bag of words models work actually emphasizes the flexibility and beauty of language. It highlights the deep meaning embedded in the words themselves and illustrates just how much we communicate when we communicate.

Linguistic philosophers often marvel that we can manage to communicate at all – the words we exchange may not mean the same thing to me as they do to you. In fact, they almost certainly do not.

In this sense, language is an architectural wonder; a true feat of achievement. We convey so much with subtle meanings of word choice, order, and grammatical flourishes. And somehow through the cacophony of this great symphony – which we all experience uniquely – we manage to schedule meetings, build relationships, and think critically together.

Much is lost in translating the linguistic signal between me and you. We miss each other’s context and reinterpret the subtle flavors of each word. We can hear a whole lecture without truly understanding, even if we try.

And that, I think, is why the bag of words approach works. Linguistic signals are rich, they are fiercely high-dimensional and full of more information than any person can process.

Do we lose something when we reduce dimensionality? When we discard word order and treat a text as a bag of words?

Of course.

But that isn’t an indication of the gaudiness of language; rather it is a tribute to it’s profound persistence.

facebooktwittergoogle_plusredditlinkedintumblrmail

Computational Models of Belief Systems & Cultural Systems

Work on belief systems is similar to the research on cultural systems – both use agent-based models to explore how complex systems evolve given a simple set of actor rules and interactions – there are important conceptual differences between the two lines of work.

Research on cultural systems takes a maco-level approach, seeking to explain if, when, and how, distinctive communities of similar traits emerge, while research on belief systems uses comparable methods to understand if, when, and how distinctive individuals come to agree on a given point.

The difference between these approaches is subtle but notable. The cultural systems approach begins with the observation that distinctive cultures do exist, despite local tendencies for convergence, while research on belief systems begins from the observation that groups of people are capable of working together, despite heterogeneous opinions and interests.

In his foundational work on cultural systems, Axelrod begins, “despite tendencies towards convergence, differences between individuals and groups continue to exist in beliefs, attitudes, and behavior” (Axelrod, 1997).

Compare this to how DeGroot begins his exploration of belief systems: “consider a group of individuals who must act together as a team or committee, and suppose that each individual in the group has his own subjective probability distribution for the unknown value of some parameter. A model is presented which describes how the group might reach agreement on a common subjective probability distribution parameter by pooling their individual opinions” (DeGroot, 1974).

In other words, while cultural models seek to explain the presence of homophily and other system-level traits, belief systems more properly seek to capture deliberative exchange. The important methodological difference here is that cultural systems model agent change as function of similarity, while belief systems model agent change as a process of reasoning.

 

facebooktwittergoogle_plusredditlinkedintumblrmail

Computational Models of Cultural Systems

Computational approaches to studying the broader social context can be found in work on the emergence and diffusion of communities in cultural system. Spicer makes an anthropological appeal for the study of such systems, arguing that cultural change can only be properly considered in relation to more stable elements of culture. These persistent cultural elements, he argues, can best be understood as ‘identity systems,’ in which individuals bestow meaning to symbols. Spicer notes that there are collective identity systems (i.e., culture) as well as individual systems, and chooses to focus his attention on the former. Spicer talks about these systems in implicitly network terms: identity systems capture “relationships between human beings and their cultural products” (Spicer, 1971). To the extent that individuals share the same relationships with the same cultural products, they are united under a common culture; they are, as Spicer says, “a people.”

Axelrod presents a more robust mathematical model for studying these cultural systems. Similar to Schelling’s dynamic models of segregation, Axelrod imagines individuals interacting through processes of social influence and social selection (Axelrod, 1997). Agents are described with n-length vectors, with each element initialized to a value between 0 and m. The elements of the vector represent cultural dimensions (features), and the value of each element represents an individual’s state along that dimension (traits). Two individuals with the exact same vector are said to share a culture, while, in general, agents are considered culturally similar to the extent to which they hold the same trait for the same feature. Agents on a grid are then allowed to interact: two neighboring agents are selected at random. With a probability equal to their cultural similarity, the agents interact. An interaction consists of selecting a random feature on which the agents differ (if there is one), and updating one agent’s trait on this feature to its neighbor’s trait on that feature. This simple model captures both the process of choice homophily, as agents are more likely to interact with similar agents, and the process of social influence, as interacting agents become more similar over time. Perhaps the most surprising finding of Axelrod’s approach is just how complex this cultural system turns out to be. Despite the model’s simple rules, he finds that it is difficult to predict the ultimate number of stable cultural regions based on the system’s n and m parameters.

This concept of modeling cultural convergence through simple social processes has maintained a foothold in the literature and has been slowly gaining more widespread attention. Bednar and Page take a game theoretic approach, imagining agents who must play multiple cognitively taxing games simultaneously. Their finding that in these scenarios “culturally distinct behavior is likely and in many cases unavoidable” (Bednar & Page, 2007) is notable because classic game-theoretic models fail to explain the emergence of culture at all: rather rational agents simply maximize their utility and move on. In their simultaneous game scenarios, however, cognitively limited agents adopt the strategies that can best be applied across the tasks they face. Cultures, then, emerge as “agents evolve behaviors in strategic environments.” This finding underscores Granovetter’s argument of embeddedness (M. Granovetter, 1985): distinctive cultures emerge because regional contexts influence adaptive choices, which in turn influence an agent’s environment.

Moving beyond Axelrod’s grid implementation, Flache and Macy (Flache & Macy, 2011) consider agent interaction on the small world network proposed by Watts and Strogatz (Watts & Strogatz, 1998). This model randomly rewires a grid with select long-distance ties. Following Granovetter’s strength of weak ties theory (M. S. Granovetter, 1973), the rewired edges in the Watts-Strogatz model should bridge clusters and promote cultural diffusion. Flache and Macy also introduce the notion of the valiance of interaction, considering social influence along dimensions of assimilation and differentiation, and taking social selection to consist of either attraction or xenophobia. In systems with only positively-valenced interaction (assimilation and attraction), they find that the ‘weak’ ties have the expected result: cultural signals diffuse and the system tends towards cultural integration. However, introduction of negatively valenced interactions (differentiation and xenophobia), leads to cultural polarization; resulting in deep disagreement between communities which themselves have high internal consensus.

facebooktwittergoogle_plusredditlinkedintumblrmail

The Joint Effects of Content and Style on Debate Outcomes

I am heading out later today to head to the Midwest Political Science Association (MPSA) conference. My advisor, Nick Beauchamp will be presenting our joint work on “The Joint Effects of Content and Style on Debate Outcomes.”

Here is the abstract for that work:

Debate and deliberation play essential roles in politics and government, but most models presume that debates are won mainly via superior style or agenda control. Ideally, however, debates would be won on the merits, as a function of which side has the stronger arguments. We propose a predictive model of debate that estimates the effects of linguistic features and the latent persuasive strengths of different topics, as well as the interactions between the two. Using a dataset of 118 Oxford-style debates, our model’s combination of content (as latent topics) and style (as linguistic features) allows us to predict audience-adjudicated winners with 74% accuracy, significantly outperforming linguistic features alone (66%). Our model finds that winning sides employ stronger arguments, and allows us to identify the linguistic features associated with strong or weak arguments.

facebooktwittergoogle_plusredditlinkedintumblrmail

Demographic bias in social media language analysis

Before the break, I had the opportunity to hear Brendan O’Connor talk about his recent paper with Su Lin Blodgett and Lisa Green: Demographic Dialectal Variation in Social Media: A Case Study of African-American English.

Imagine an algorithm designed to classify sentences. Perhaps it identifies the topic of the sentence or perhaps it classifies the sentiment of the sentence. These algorithms can be really accurate – but they are only as good as the corpus they are trained on.

If you train an algorithm on the New York Times and then try to classify tweets, for example, you may not have the kind of success you might like – the language and writing style of the Times and a typical tweet being so different.

There’s a lot of interesting stuff in the Blodgett et al. paper, but perhaps most notable to me is their comparison of the quality of existing language identification tools on tweets by race. They find that these tools perform poorly on text associated with African Americans while performing better on text associated with white speakers.

In other words, if you got a big set of Twitter data and filtered out the non-English tweets, that algorithm would disproportionally identify tweets from black authors as not being in English, and those tweets would then be removed from the dataset.

Such an algorithm, trained on white language, has the unintentional effect of literally removing voices of color.

Their paper presents a classifier to eliminate that disparity, but the study is an eye-opening finding – a cautionary tail for anyone undertaking language analysis. If you’re not thoughtful and careful in your approach, even the most validated classifier may bias your data sample.

facebooktwittergoogle_plusredditlinkedintumblrmail

Visualizing Pareto Fronts

As the name implies, multi-objective optimization problems are a class of problems in which one seeks to optimize over multiple, conflicting objectives.

Optimizing over one objective is relatively easy: given information on traffic, a navigation app can suggest which route it expects to be the fastest. But if you have multiple objectives this problem become complicated: if, for example, you want a reasonably fast route that won’t use too much gas and gives you time to take in the view outside your window.

Or, perhaps, you have multiple deadlines pending and you want to do perfectly on all of them, but you also have limited time and would like to eat and maybe sleep sometime, too. How do you prioritize your time? How do you optimize over all the possible things you could be doing?

This is not easy.

Rather than having a single, optimal solution, these problems have a set of solutions, known as the Pareto front. Each of these solutions is equally optimal mathematically, but each represents a different trade-off in optimization of the features.

Using 3D Rad-Viz, Ibrahim et al. have visualized the complexity of the Pareto front, showing the bumpy landscape these solution spaces have.
screen-shot-2016-12-12-at-1-40-58-pm

Chen et al. take a somewhat different approach – designing a tool to allow a user to interact with the Pareto front, visually seeing the trade-offs each solution implicitly makes and allowing a user to select the solutions they see as best meeting their needs:

screen-shot-2016-12-12-at-1-41-14-pm

facebooktwittergoogle_plusredditlinkedintumblrmail

The Use of Faces to Represent Points in k-Dimensional Space Graphically

This is my new favorite thing.

Herman Chernoff’s 1972 paper, “The Use of Faces to Represent Points in k-Dimensional Space Graphically.” The name is pretty self-explanatory: it’s an attempt to represent high dimensional data…through the use, as Chernoff explains, of “a cartoon of a face whose features, such as length of nose and curvature of mouth, correspond to components of the point.”

Here’s an example:

screen-shot-2016-12-05-at-7-13-36-pm

I just find this hilarious.

But, as crazy as this approach may seem – there’s something really interesting about it. Most standard efforts to represent high dimensional data revolve around projecting that data into lower dimensional (eg, 2 dimensional) space. This allows the data to be shown on standard plots, but risks loosing something valuable in the data compression.

Showing k-dimsional data as cartoon faces is probably not the best solution, but I appreciate the motivation behind it – the questioning, ‘how can we present high dimensional data high dimensionally?’

facebooktwittergoogle_plusredditlinkedintumblrmail