Facts, Power, and the Bias of AI

I spent last Friday and Saturday at the 7th Annual Text as Data conference, which draws together scholars from many different universities and disciplines to discuss developments in text as data research. This year’s conference, hosted by Northeastern, featured a number of great papers and discussions.

I was particularly struck by a comment from Joanna J. Bryson as she presented her work with Aylin Caliskan-Islam, Arvind Narayanan on A Story of Discrimination and Unfairness: Using the Implicit Bias Task to Assess Cultural Bias Embedded in Language Models:

There is no neutral knowledge.

This argument becomes especially salient in the context of artificial intelligence: we tend to think of algorithms as neutral, fact-based processes which are free from the biases we experience as humans. But such a simplification is deeply faulty. As Bryson argued, AI won’t be neutral if it’s based on human culture; there is no neutral knowledge.

This argument resonates quite deeply with me, but I find it particularly interesting through the lens of an increasingly relativistic world: as facts increasingly become seen as matters of opinion.

To complicate matters, there is no clear normative judgment that can be applied to such relativism: on the one hand this allows for embracing diverse perspectives, which is necessary for a flourishing, pluralistic world. On the other hand, nearly a quarter of high school government teachers in the U.S. report that parents or others would object if they discussed politics in a government classroom.

Discussing “current events” in a neutral manner is becoming increasingly challenging if not impossible.

This comment also reminds me of the work of urban planner Bent Flyvbjerg who turns an old axiom on its head to argue that “power is knowledge.” Flyvbjerg’s concern doesn’t require a complete collapse into relativism, but rather argues that “power procures the knowledge which supports its purposes, while it ignores or suppresses that knowledge which does not serve it.” Power, thus, selects what defines knowledge and ultimately shapes our understanding of reality.

In his work with rural coal minors, John Gaventa further showed how such power dynamics can become deeply entrenched, so the “powerless” don’t even realize the extent to which their reality is dictated by the those with power.

It is these elements which make Bryson’s comments so critical; it is not just that there is no neutral knowledge, but that “knowledge” is fundamentally controlled and defined by those in power. Thus it is imperative that any algorithm take these biases into account – because they are not just the biases of culture, but rather the biases of power.

facebooktwittergoogle_plusredditlinkedintumblrmail

Reflections from the Trenches and the Stacks

In my Network Visualization class, we’ve been talking a lot about methodologies for design research studies. On that topic, I recently read an interesting article by Michael Sedlmair, Miriah Meyer, and Tamara Munzner: Design Study Methodology: Reflections from the Trenches and the Stacks, after conducting a literature review to determine best practices, they realized that there were no best practices – at least not organized in a coherent, practical to follow way.

Thus, the authors aim to develop “holistic methodological approaches for conducting design studies,” drawn from their combined experiences as researchers as well as from their review of the literature in this field. They define the scope of their work very clearly: they aim to develop a practical guide to determine methodological approaches in “problem-driven research,” that is, research where “the goal is to work with real users to solve their real-world problems.”

Their first step in doing so is to define a 2-dimensional space in which any proposed research task can be placed. One axis looks at task clarity (from fuzzy to crisp) and the other looks at information location (from head to computer). These strike me as helpful axises for positioning a study and for thinking about what kinds of methodologies are appropriate. If your task is very fuzzy, for example, you may want to start with a study that clarifies the specific tasks which need to be examined. If your task is very crisp, and can be articulated computationally…perhaps you don’t need a visualization study but can rather do everything algorithmically.

From my own experience of user studies in a marketing context, I found these axes a very helpful framework for thinking about specific needs and outcomes – and therefore appropriate methodologies – of a research study.

The authors then go into their nine-stage framework for practical guidance in conducting design studies and their 32 identified pitfalls which can occur throughout the framework.

The report can be distilled more briefly into 5 steps a researcher should go through in designing, implementing, and sharing a study. These five stages should feed into each other and are not necessarily neatly chronological:

  1. Before designing a study think carefully about what you hope to accomplish and what approach you need. (Describe the clarity/information location axes are a tool for doing this).
  2. Think about what data you have and who needs to be part of the conversation.
  3. Design and implement the study
  4. Reflect and share your results
  5. Throughout the process, be sure to think carefully about goals, timelines and roles

Their paper, of course, goes into much greater detail about each of these five steps. But overall, I find this a helpful heuristic in thinking about the steps one should go through.

facebooktwittergoogle_plusredditlinkedintumblrmail

Node Overlap Removal by Growing a Tree

I recently read Lev Nachmanson, Arlind Nocaj, Sergey Bereg, Leishi Zhang, and Alexander Holroyd’s article on “Node Overlap Removal by Growing a Tree,” which presents a really interesting method.

Using a minimum spanning tree to deal with overlapping nodes seems like a really innovative technique. It made me wonder how the authors came up with this approach!

As outlined in the paper, the algorithm begins with a Delaunay triangulation on the node centers – more information on Delaunay triangulations here – but its essentially a maximal planar subdivision of the graph: eg, you draw triangles connecting the centers of all the nodes.

From here, the algorithm finds the minimal spanning tree, where the cost of an edge is defined so that greater node overlap the lower the cost. The minimal spanning tree, then, find the maximal overlaps in the graph. The algorithm then “grows” the tree: increasing the cost of the tree by lengthening edges. Starting at the root, the lengthening propagates outwards. The algorithm repeats no overlaps exist on the edge of the triangulation.

Impressively, this algorithm runs in O(|V|) time per iteration, making it a fast as well as an effective algorithm.

facebooktwittergoogle_plusredditlinkedintumblrmail

Representing the Structure of Data

To be perfectly honest, I had never thought much about graph layout algorithms. You hit a button in Gephi or call a networkx function, some magic happens, and you get a layout. If you don’t like the layout generated, you hit the button again or call a different function.

In one of my classes last year, we generated our own layouts using eigenvectors of the Laplacian. This gave me a better sense of what happens when you use a layout algorithm, but I still tended to think of it as a step which takes place at the end of an assignment; a presentation element which can make your research accessible and look splashy.

In my visualization class yesterday, we had a guest lecture by Daniel Weidele, PhD student at University of Konstanz and researcher at IBM Watson. He covered fundamentals of select network layout algorithms but also spoke more broadly about the importance of layout. A network layout is more than a visualization of a collection of data, it is the final stage of a pipeline which attempts to represent some phenomena. The whole modeling process abstracts a phenomena into a concept, and the represents that concept as a network layout.

When you’re developing a network model for a phenomenon, you ask questions like “who is your audience? What are the questions we hope to answer?” Daniel pointed out that you should ask similar questions when evaluating a graph layout; the question isn’t just “does this look good?” You should ask: “Is this helpful? What does it tell me?”

If there are specific questions you are asking your model, you can use a graph layout to get at the answers. You may, for example, ask: “Can I predict partitioning?”

This is what makes modern algorithms such as stress optimization so powerful – it’s not just that they produce pretty pictures, or even that they layouts appropriately disambiguate nodes, but they actually represent the structure of the data in a meaningful way.

In his work with IBM Watson, Weidele indicated that a fundamental piece of their algorithm design process is building algorithms based on human perception. For a test layout, try to understand what a human likes about it, try to understand what a human can infer from it – and then try to understand the properties and metrics which made that human’s interpretation possible.

facebooktwittergoogle_plusredditlinkedintumblrmail

Large Graph Layout Algorithms

Having previously tried to use force-directed layout algorithms on large networks, I was very intrigued by

Stefan Hachul and Michael Junger’s article Large Graph-Layout Algorithms at Work: An Experimental Study. In my experience, trying to generate a layout for a large graph results in little more than a hairball and the sense that one really ought to focus on just a small subgraph.

With the recent development of increasingly sophisticated layout algorithms, Hachul and Junger compare the performance of several classical and more recent algorithms. Using a collection graphs – some relatively easy to layout and some more challenging – the authors compare the runtime and aesthetic output.

All the algorithms strive for the same aesthetic properties: uniformity of edge length, few edge crossings, non-overlapping nodes and edges, and the display of symmetries – which makes aesthetic comparison measurable.

Most of the algorithms performed well on the easier layouts. The only one which didn’t was their benchmark Grid-Variant Algorithm (GVA), a spring-layout which divides the drawing area into a grid and only calculates the repulsive forces acting between nodes that are placed relatively near to each other.

For the harder graphs, they found that the Fast Multipole Multilevel Method (FM3) often produced the best layout, though it is slower than High-Dimensional Embedding (HDE) and the Algebraic Multigrid Method (ACE), which can both produce satisfactory results. Ultimately, Hachul and Junger recommend as practical advice: “first use HDE followed by ACE, since they are the fastest methods…if the drawings are not satisfactory or one supposes that important details of the graph’s structure are hidden, use FM3.”

What’s interesting about this finding is that HDE and ACE both rely solely on linear algebra rather than the physical analogies of force-directed layouts. FM3, on the other hand – notably developed by Hachul and Junger – is force-directed.

In ACE, the algorithm minimizes the quadratic form of the Laplacian (xTLx), finding the eigenvectors of L that are associated with the two smallest eigenvalues. Using an algebraic multigrid algorithm to calculate the eigenvectors makes the algorithm among the fastest tested for smaller graphs.

By far the fastest algorithm was HDE, which takes a really interesting, two-step approach. First approximating a high-dimensional k-clustering solution and then projecting those clusters into 2D space by calculating the eigenvectors of the covariance matrix from the clusters. The original paper describing the algorithm is here.

Finally, the slower but more aesthetically reliable FM3 algorithm improves upon classic force-direct approaches by relying on an important assumption: in large graphs, you don’t necessarily have to see everything. In this algorithm, “subgraphs with a small diameter (called solar systems) are collapsed” resulting in a final visualization which captures the structure of the large network with the visual ease of a smaller network.

facebooktwittergoogle_plusredditlinkedintumblrmail

The Effects of Interactive Latency on Exploratory Visual Analysis

In their paper, Zhicheng Liu and Jeffrey Heer explore “The Effects of Interactive Latency on Exploratory Visual Analysis” – that is, how user behavior changes with system response time. As the authors point out, while it seems intuitively ideal to minimize latency, effects vary by domain.

In strategy games, “latency as high as several seconds does not significantly affect user performance,” most likely because tasks which “take place at a larger time scale,” such as “understanding game situation and conceiving strategy” play a more important role in affecting the outcome of a game. In a puzzle game, imposed latency caused players to solve the puzzle in fewer moves – spending more time mentally planning their moves.

These examples illustrate perhaps the most interesting aspect of latency: while it’s often true that time delays will make users bored or frustrated, that is not the only dimension of effect. Latency can alter the way a user thinks about a problem; consciously or unconsciously shifting strategies to whatever seems more time effective.

Liu and Heer focus on latency effecting “knowledge discovery with visualizations,” a largely unexplored area. One thing which makes this domain unique is that “unlike problem-solving tasks or most computer games, exploratory visual analysis is open-ended and does not have a clear goal state.”

The authors design an experimental setup in which participants are asked to explore two different datasets and “report anything they found interesting, including salient patterns in the visualizations, their interpretations, and any hypotheses based on those patterns.” Each participant experienced an additional 500ms latency in one of the datasets. They recorded participant mouse clicks, as well as 9 additional “application events,” such as zoom and color slider, which capture user interaction with the visualization.

The authors also used a “think aloud protocol” to capture participant findings. As the name implies, a think aloud methodology asks users to continually describe what they are thinking as they work. A helpful  summary of the benefits and downsides of this methodology can be found here.

Liu and Heer find that latency does have significant effects: latency decreased user activity and coverage of the dataset, while also “reducing rates of observation, generalization and hypothesis.” Additionally, users who experienced the latency earlier in the study had “reduced rates of observation and generalization during subsequent analysis sessions in which full system performance was restored.”

This second finding lines up with earlier research which found that a delay of 300ms in web searches reduced the number of searches a user would perform – a reduction which would persist for days after latency was restored to previous levels.

Ultimately, the authors recommend “taking a user-centric approach to system optimization” rather than “uniformly focusing on reducing latency” for each individual visual operation.

facebooktwittergoogle_plusredditlinkedintumblrmail

Analytic Visualization and the Pictures in Our Head

In 1922, American journalist and political philosopher Walter Lippmann wrote about the “pictures in our head,” arguing that we conceptualize distant lands and experiences beyond our own through a mental image we create. He coined the word “stereotypes” to describe these mental pictures, launching a field of political science focused on how people form, maintain, and change judgements.

While visual analytics is far from the study of stereotypes, in some ways it relies on the same phenomenon. As described in Illuminating the Path, edited by James J. Thomas and Kristin A. Cook, there is an “innate connection among vision, visualization, and our reasoning processes.” Therefore, they argue, the full exercise of reason requires “visual metaphors” which “create visual representations that instantly convey the important content of information.”

F. J. Anscombe’s 1973 article Graphs in Statistical Analysis makes a similar argument. While we are often taught that “performing intricate calculations is virtuous, whereas actually looking at the data is cheating,” Anscombe elegantly illustrates the importance of visual representation through his now-famous Anscombe’s Quartet. These four data sets all have the same statistical measures when considered as a linear regression, but the visual plots quickly illustrate their differences. In some ways, Anscombe’s argument perfectly reinforces Lippmann’s argument from five decades before: it’s not precisely problematic to  have a mental image of something; but problems arise when the “picture in your head” does not match the picture in reality.

As Anscombe argues, “in practice, we do not know that the theoretical description [linear regression] is correct, we should generally suspect that it is not, and we cannot therefore heave a sigh of relief when the regression calculation has been made, knowing that statistical justice has been done.”

Running a linear regression is not enough. The results of a linear regression are only meaningful if the data actually fit a linear model. The best and fastest way to check this is to actually observe the data; to visualize it to see if it fits the “picture in your head” of linear regression.

While Anscombe had to argue for the value of visualizing data in 1973, the practice has now become a robust and growing field. With the rise of data journalism, numerous academic conferences, and a growing focus on visualization as storytelling, even a quiet year for visualization – such as 2014 – was not a “bad year for information visualization” according to Robert Kosara, Senior Research Scientist at Tableau Software.

And Kosara finds even more hope for the future. With emerging technologies and a renewed academic focus on developing theory, Kosara writes, “I think 2015 and beyond will be even better.”

facebooktwittergoogle_plusredditlinkedintumblrmail

Semantic and Epistemic Networks

I am very interested in modeling a person’s network of ideas. What key concepts or values particularly motivate their thinking and how are those ideas connected?

I see this task as being particularly valuable in understanding and improving civil and political discourse. In this model, dialogue can be seen as an informal and iterative process through which people think about how their own ideas are connected, reason with each other about what ideas should be connected, and ultimately revise (or don’t) their way of thinking by adding or removing idea nodes or connections between them.

This concept of knowledge networks – epistemic networks – has been used by David Williamson Shaffer to measure the development of students’ professional knowledge; eg their ability to “think like an engineer” or “think like an urban planner.” More recently, Peter Levine has advanced the use of epistemic networks in “moral mapping” – modeling a person’s values and ways of thinking.

This work has made valuable progress, but a critical question remains: just what is the best way to model a person’s epistemic network? Is there an unbiased way to determine the most critical nodes? Must we rely on a given person’s active reasoning to determine the links? In the case of multi-person exchanges, what determines if two concepts are the “same”? Is semantic similarity sufficient, or must individuals actively discuss and determine that they do each indeed mean the same thing? If you make adjustments to a visualized epistemic network following a discussion, can we distinguish between genuine changes in view from corrections due to accidental omission?

Questions and challenges abound.

But these problems aren’t necessarily insurmountable.

As a starting place, it is helpful to think about semantic networks. In the 1950s, Richard H. Richens original proposed semantic networks as a tool to aid in machine translation.

“I refer now to the construction of an interlingua in which all the structural peculiarities of the base language are removed and we are left with what I shall call a ‘semantic net’ of ‘naked ideas,'” he wrote. “The elements represent things, qualities or relations…A bond points from a thing to its qualities or relations, or from a quality or relation to a further qualification.”

Thus, from its earliest days, semantic networks were seen as somewhat synonymous with epistemic networks: words presumably represent ideas, so it logically follows that a network of words is a network of ideas.

This may well be true, but I find it helpful to separate the two ideas. A semantic network is observed; an epistemic network is inferred.

That is, through any number of advanced Natural Language Processing algorithms, it is essentially possible to feed text into a computer and have it return of network of words which are connected in that text.

You can imagine some simple algorithms for accomplishing this: perhaps two words are connected if they co-occur in the same sentence or paragraph. Removing stop words prevents your retrieved network from being over connected by instances of “the” or “a.” Part-of-speech tagging – a relatively simple task thanks to huge databanks of tagged corpora – can bring an additional level of sophistication. Perhaps we want to know which subjects are connected to which objects. And there are even cooler techniques relying on probabilistic models or projections of the corpus into k-space, where k is the number of unique words.

These models typically assume some type of unobserved data – eg, we observe a list of words and use that to discover the unobserved connections – but colloquially speaking, semantic networks are observed in the sense that they can be drawn out directly from a text. They exist in some indirect but concrete way.

And while it seems fair to assume that words do indeed have meaning, it still takes a bit of a leap to take a semantic network as synonymous with an epistemic network.

Consider an example: if we were to take some great novel and cleverly reduce it to a semantic network, would the resulting network illustrate exactly what the author was intending?

The fact that it’s even worth asking that question to me indicates that the two are not intrinsically one and the same.

Arguably, this is fundementally a matter of degrees. It seems reasonable to say that, unless our algorithm was terribly off, the semantic network can tell us something interesting and worthwhile about the studied text. Yet it seems like a stretch to claim that such a simplistic representation could accurately and fully capture the depth of concepts and connections an author was seeking to convey.

If that were the case, we could study networks instead of reading books and – notably – everyone would agree on their meaning.

A semantic network, then, can be better considered as a representation of an epistemic network. It takes reason and judgement to interpret a semantic network epistemically.

Perhaps it is sufficient to be aware of the gap between these two – to know that interpreting a semantic network epistemically necessarily means introducing bias and methodological subjectivity.

But I wonder if there’s something better we can do to model this distinction – some better way to capture the complex, dynamic, and possibly conflicting essence of a more accurately epistemic network.

facebooktwittergoogle_plusredditlinkedintumblrmail

Predictive Accuracy and Good Dialogue

While I’m relatively new to the computer science domain, one thing that’s notable is the field’s obsession with predictive accuracy. Particularly within natural language processing, the primary objective of most scholars – or, perhaps, more exactly, the requirement for being published – seems to be producing methods which edge past the accuracy of existing approaches.

I’m not really in a position to comment on the benefit of such a driver, but as an outsider, this focus is striking. I have to imagine there are great, historical reasons why the field evolved this way; that the mentality of constantly pushing towards incremental improvement has been an important factor in the great breakthroughs of computer science.

Yet, I can’t help feel like in this quest for computational improvement, something important is being left behind.

There are compelling arguments that the social sciences have done poorly to abandon their humanistic roots in favor of emulating the fashionable fields of science; that in grasping for predictive measures, social science has failed its duty towards the most critical concerns of what is right and good. Perhaps, after all, questions of such import should not be solely the domain of philosophy departments.

It seems a similar objection could be raised towards computer science; and no doubt someone I’m not aware of has raised these concerns. Such an approach would go beyond the philosophical literature on moral issues in computer science, probing more deeply into questions of meaning, interpretation, and structure.

Wittgenstein questioned fundamentally what it means for two people to communicate. Austin argues that words themselves can be actions. And there is, of course, a long tradition in many cultures of words having power.

None of these topics, while intrinsic to natural language, seem to be deeply embraced by current approaches to natural language processing. Much better to show a two point increase in predictive accuracy.

And to a certain extent, this dismissal is fair. While I myself have a fondness for Wittgenstein, I imagine computer science wouldn’t advance far if, instead of developing algorithms, practitioners spent all their time wondering – if you tell me you are in pain, do I understand you because I, too, have had my own experiences of pain? How can I know what ‘pain’ means to you? 

Yet, while Wittgenstein’s Philosophical Investigations may be too far afield, it does highlight some practical issues. Perhaps metaphysical concerns about what it means to communicate can be safely disregarded, but this still leaves questions about what it looks like to communicate. That is, it seems reasonable to assume that miscommunication does happen, but what happens to dialogue plagued by such problems? What does it look like when people talk past each other or when they recognize a miscommunication and take steps to resolve it? Can an algorithm distinguish and properly parse these differences? Remembering, of course, that a human, perhaps, cannot.

In a recent review of literature around the natural language processing task of argument mining, I was struck by the value of a 1987 paper focused on understanding the structure of a single speech-act. It evoked no Wittgenstein-level of abstraction, and yet brought an important element of theory to the computational task of parsing a single argument.

I couldn’t find – and perhaps I missed it – no similar paper exploring the complex interactions of dialogue. Of course, there is much work done in this area among deliberation scholars – but this effort is not easily translated into the mechanized logic of algorithms.

In short, there seems to be a divide – a common one, I’m afraid, in the academy. In one field, theorists ask, what does it mean to deliberate? What makes good deliberation? And in another they ask, what algorithms can recognize arguments? What algorithms accurately predict stance? 

And, while both pursuing important work, the fields fail to learn from each other.

facebooktwittergoogle_plusredditlinkedintumblrmail

Comparing Texts with Log-Likelihood Word Frequencies

One way to compare the similarity of documents is to examine the comparative log-likelihood of  word frequencies.

This can be done with any two documents, but it is a particularly interesting way to compare the similarity of a smaller document with the larger body of text it is drawn from. For example, with access to the appropriate data, you may want to know how similar Shakespeare was to his contemporaries. The Bard is commonly credited with coining a large number of words, but it’s unclear exactly how true this is – after all, the work of many of his contemporaries has been lost.

But, imagine you ran across a treasure trove of miscellaneous documents from 1600 and you wanted to compare them to Shakespeare’s plays. You could do this by calculating the expected frequency of a given word and comparing this to the observed frequency. First, you can calculate the expected frequency as:

Screen Shot 2016-04-29 at 7.53.54 AM

Where Ni is the total number of words in document i and Oi is the observed frequency of a given word in document i. That is, the expected frequency of a word is: (number of words in your sub corpus) * (sum of observed frequency in both corpora) / the number of words in both corpora.

 

 

Then, you can use this expectation to determine a word’s log-likelihood given the larger corpus as:

Screen Shot 2016-04-29 at 7.54.02 AM

Sorting words by their log-likelihood, you can then see the most unlikely – eg, the most unique – words in your smaller corpus.

facebooktwittergoogle_plusredditlinkedintumblrmail