Multivariate Network Exploration and Presentation

In “Multivariate Network Exploration and Presentation,” authors Stef van den Elzen and Jarke J. van Wijk introduce an approach they call “Detail to Overview via Selections and Aggregations,” or DOSA. I was going to make fun of them for naming their approach after a delicious south Indian dish, but since they comment that their name “resonates with our aim to combine existing ingredients into a tasteful result,” I’ll have to just leave it there.

The DOSA approach – and now I am hungry – aims to allow a user to explore the complex interplay between network topology and node attributes. For example, in company email data, you may wish to simultaneously examine assortativity by gender and department over time. That is, you may need to consider both structure and multivariate data.

This is a non-trivial problem, and I particularly appreciated van den Elzen and van Wijk’s practical framing of why this is a problem:

“Multivariate networks are commonly visualized using node-link diagrams for structural analysis. However, node-link diagrams do not scale to large numbers of nodes and links and users regularly end up with hairball-like visualizations. The multivariate data associated with the nodes and links are encoded using visual variables like color, size, shape or small visualization glyphs. From the hairball-like visualizations no network exploration or analysis is possible and no insights are gained or even worse, false conclusions are drawn due to clutter and overdraw.”

YES. From my own experience, I can attest that this is a problem.

So what do we do about it?

The authors suggest a multi-pronged approach which allows non-expert users to select nodes and edges of interest, simultaneously see a detail and infographic-like overview, and to examine the aggregated attributes of a selection.

Overall, this approach looks really cool and very helpful. (The paper did win the “best paper” award at the IEEE Information Visualization 2014 Conference, so perhaps that shouldn’t be that surprising.) I was a little disappointed that I couldn’t find the GUI implementation of this approach online, though, which makes it a little hard to judge how useful the tool really is.

From their screenshots and online video, however, I find that while this is a really valiant effort to tackle a difficult problem, there is still more work to do in this area. The challenge with visualizing complex networks is indeed that they are complex, and while DOSA gives a user some control over how to filter and interact with this complexity, there is still a whole lot going on.

While I appreciate the inclusion of examples and use cases, I would have also liked to see a user design study evaluating how well their tool met their goal of providing a navigation and exploration tool for non-experts. I also think that the issues of scalability with respect to attributes and selection that they raise in the limitations section are important topics which, while reasonably beyond the scope of this paper, ought to be tackled in future work.


Representing the Structure of Data

To be perfectly honest, I had never thought much about graph layout algorithms. You hit a button in Gephi or call a networkx function, some magic happens, and you get a layout. If you don’t like the layout generated, you hit the button again or call a different function.

In one of my classes last year, we generated our own layouts using eigenvectors of the Laplacian. This gave me a better sense of what happens when you use a layout algorithm, but I still tended to think of it as a step which takes place at the end of an assignment; a presentation element which can make your research accessible and look splashy.

In my visualization class yesterday, we had a guest lecture by Daniel Weidele, PhD student at University of Konstanz and researcher at IBM Watson. He covered fundamentals of select network layout algorithms but also spoke more broadly about the importance of layout. A network layout is more than a visualization of a collection of data, it is the final stage of a pipeline which attempts to represent some phenomena. The whole modeling process abstracts a phenomena into a concept, and the represents that concept as a network layout.

When you’re developing a network model for a phenomenon, you ask questions like “who is your audience? What are the questions we hope to answer?” Daniel pointed out that you should ask similar questions when evaluating a graph layout; the question isn’t just “does this look good?” You should ask: “Is this helpful? What does it tell me?”

If there are specific questions you are asking your model, you can use a graph layout to get at the answers. You may, for example, ask: “Can I predict partitioning?”

This is what makes modern algorithms such as stress optimization so powerful – it’s not just that they produce pretty pictures, or even that they layouts appropriately disambiguate nodes, but they actually represent the structure of the data in a meaningful way.

In his work with IBM Watson, Weidele indicated that a fundamental piece of their algorithm design process is building algorithms based on human perception. For a test layout, try to understand what a human likes about it, try to understand what a human can infer from it – and then try to understand the properties and metrics which made that human’s interpretation possible.


Adventures in Network Science

Every time someone asks me how school is going, I have the tendency to reply with an enthusiastic but nondescript, “AWESOME!” Or, as one of my classmates has taken to saying, “WHAT A TIME TO BE ALIVE!”

Truly, it is a privilege to be able to experience such awe.

As it turns out, however, these superlatives aren’t particularly informative. And while I’ve struggled to express the reasons for my raw enthusiasm in more coherent terms, I will to attempt to do so here.

First, my selected field of study, network science, is uniquely interdisciplinary. I can practically feel you rolling your eyes at that tiredly clichéd turn of phrase – yes, yes, every program in higher education is unique interdisciplinary these days – but, please, bear with me.

I work on a floor with physicists, social scientists, and computer scientists; with people who study group dynamics, disease spreading, communication, machine learning, social structures, neuroscience, and numerous other things I haven’t even discovered yet. Every single person is doing something interesting and cool.

I like to joke that the only thing on my to-do list is to rapidly acquire all of human knowledge.

In the past year, I have taken classes in physics, mathematics, computer science, and social science. I have read books on philosophy, linguistics, social theory, and computational complexity – as well as, of course, some good fiction.

I can now trade nerdy jokes with people from any discipline.

And I’ve been glad to develop this broad and deep knowledge base. In my own work, I am interested in the role of people in their communities. More specifically, I’m looking at deliberation, opinion change, and collective action. That is – we each are a part of many communities, and our interactions with other people in those communities fundamentally shape the policies, institutions, and personalities of those communities.

These topics have been tackled in numerous disciplines, but in disparate efforts which have not sufficiently learned from each other’s progress. Deliberative theory has thought deeply about what good political dialogue looks like; behavioral economics has studied how individual choices result in larger implications and institutions; and computer science has learned how to identify startling patterns in complex datasets. But only network science brings all these elements together; only network science draws on the full richness of this knowledge base to look more deeply at interaction, connection, dynamics, and complexity.

But perhaps the most exciting thing about this program is that it truly allows me to find my own path. I’m not training to replicate some remarkable scholar who already exists – I am learning from many brilliant scholars what valuable contributions I will uniquely be able to make.

Because as much as I have to learn from everyone I meet – we all have something to learn from each other.

There are other programs in data science or network analysis, but this is the only place in the world where I can truly explore the breadth of network science and discover what kind of scholar I want to be.


I joke about trying to acquire all of human knowledge because, of course, I cannot learn everything – no one person can. But we can each cultivate our own rich understanding of the puzzle. And through the shared language of network science, we can share our knowledge, work together, and continue to chip away at understanding the great mysterious of the universe.


Semantic and Epistemic Networks

I am very interested in modeling a person’s network of ideas. What key concepts or values particularly motivate their thinking and how are those ideas connected?

I see this task as being particularly valuable in understanding and improving civil and political discourse. In this model, dialogue can be seen as an informal and iterative process through which people think about how their own ideas are connected, reason with each other about what ideas should be connected, and ultimately revise (or don’t) their way of thinking by adding or removing idea nodes or connections between them.

This concept of knowledge networks – epistemic networks – has been used by David Williamson Shaffer to measure the development of students’ professional knowledge; eg their ability to “think like an engineer” or “think like an urban planner.” More recently, Peter Levine has advanced the use of epistemic networks in “moral mapping” – modeling a person’s values and ways of thinking.

This work has made valuable progress, but a critical question remains: just what is the best way to model a person’s epistemic network? Is there an unbiased way to determine the most critical nodes? Must we rely on a given person’s active reasoning to determine the links? In the case of multi-person exchanges, what determines if two concepts are the “same”? Is semantic similarity sufficient, or must individuals actively discuss and determine that they do each indeed mean the same thing? If you make adjustments to a visualized epistemic network following a discussion, can we distinguish between genuine changes in view from corrections due to accidental omission?

Questions and challenges abound.

But these problems aren’t necessarily insurmountable.

As a starting place, it is helpful to think about semantic networks. In the 1950s, Richard H. Richens original proposed semantic networks as a tool to aid in machine translation.

“I refer now to the construction of an interlingua in which all the structural peculiarities of the base language are removed and we are left with what I shall call a ‘semantic net’ of ‘naked ideas,'” he wrote. “The elements represent things, qualities or relations…A bond points from a thing to its qualities or relations, or from a quality or relation to a further qualification.”

Thus, from its earliest days, semantic networks were seen as somewhat synonymous with epistemic networks: words presumably represent ideas, so it logically follows that a network of words is a network of ideas.

This may well be true, but I find it helpful to separate the two ideas. A semantic network is observed; an epistemic network is inferred.

That is, through any number of advanced Natural Language Processing algorithms, it is essentially possible to feed text into a computer and have it return of network of words which are connected in that text.

You can imagine some simple algorithms for accomplishing this: perhaps two words are connected if they co-occur in the same sentence or paragraph. Removing stop words prevents your retrieved network from being over connected by instances of “the” or “a.” Part-of-speech tagging – a relatively simple task thanks to huge databanks of tagged corpora – can bring an additional level of sophistication. Perhaps we want to know which subjects are connected to which objects. And there are even cooler techniques relying on probabilistic models or projections of the corpus into k-space, where k is the number of unique words.

These models typically assume some type of unobserved data – eg, we observe a list of words and use that to discover the unobserved connections – but colloquially speaking, semantic networks are observed in the sense that they can be drawn out directly from a text. They exist in some indirect but concrete way.

And while it seems fair to assume that words do indeed have meaning, it still takes a bit of a leap to take a semantic network as synonymous with an epistemic network.

Consider an example: if we were to take some great novel and cleverly reduce it to a semantic network, would the resulting network illustrate exactly what the author was intending?

The fact that it’s even worth asking that question to me indicates that the two are not intrinsically one and the same.

Arguably, this is fundementally a matter of degrees. It seems reasonable to say that, unless our algorithm was terribly off, the semantic network can tell us something interesting and worthwhile about the studied text. Yet it seems like a stretch to claim that such a simplistic representation could accurately and fully capture the depth of concepts and connections an author was seeking to convey.

If that were the case, we could study networks instead of reading books and – notably – everyone would agree on their meaning.

A semantic network, then, can be better considered as a representation of an epistemic network. It takes reason and judgement to interpret a semantic network epistemically.

Perhaps it is sufficient to be aware of the gap between these two – to know that interpreting a semantic network epistemically necessarily means introducing bias and methodological subjectivity.

But I wonder if there’s something better we can do to model this distinction – some better way to capture the complex, dynamic, and possibly conflicting essence of a more accurately epistemic network.


Networks of Connected Concepts

Yesterday, I ran across a fascinating 1993 paper by sociologist Kathleen Carley, Coding Choices for Textual Analysis: A Comparison of Content Analysis and Map Analysis.

Using the now antiquated term “map analysis” – what I would call semantic network analysis today – Carley explains:

An important class of methods that allows the research to address textual meaning is map analysis. Where content analysis typically focuses exclusively on concepts, map analysis focuses on concepts and the relationships between them and hence on the web of meaning contained within the text. While no term has yet to emerge as canonical, within this paper the term map analysis will be used to refer to a broad class of procedures in which the focus is on networks consisting of connected concepts rather than counts of concepts.

This idea is reminiscent of the work of Peter Levine and others (including myself) on moral mapping – representing an individual’s moral world view through a thoughtfully constructed network of ideas and values.

Of course, a range of methodological challenges are immediately raised in graphing a moral network – what do you include? What constitutes a link? Do links have strength or directionality? Trying to compare two or more people’s networks raises even more challenges.

While Carley is looking more broadly than moral networks, her work similarly aims to extract meaning, concepts, and connections from a text – and faces similar methodological challenges:

By taking a map-analytic approach, the researcher has chosen to focus on situated concepts. This choice increases the complexity of the coding and analysis process, and places the researcher in the position where a number of additional choices must be made regarding how to code the relationship between concepts.

On its face, these challenges seem like they may be insurmountable – could complex concepts such as morality ever be coded and analyzed in such a way as to be broadly interpretable while maintaining the depth of their meaning?

This conundrum is at the heart of the philosophical work of Ludwig Wittgenstein, and is far from being resolved philosophically or empirically.

Carley is hardly alone in not having a perfect resolution dilemma, but she does offer an interesting insight in contemplating it:

…by focusing on the structure of relationships between concepts, the attention of the researcher is directed towards thinking about “what am I really assuming in choosing this coding scheme?” Consequently, researchers may be more aware of the role that their assumptions are playing in the analysis and the extent to which they want to, and do, rely on social knowledge.

A network approach to these abstract concepts may indeed be inextricably biased – but, then again, all tools of measurement are. The benefit, then in undertaking the complex work of coding relationships as well as concepts, is that the researcher is more acutely aware of the bias.


The Benefits of Inefficiency

Political scientist Markus Prior has long argued that inefficiency benefits democracy. In much of his work studying the effects of media on political knowledge and participation, Prior has found that an inefficient media environment – in which people have little choice over their entertainment options – is actually conducive to improving political knowledge.

In Efficient Choice, Inefficient Democracy?, Prior explains: “Yet while a sizable segment of the population watches television primarily to be entertained, and not to obtain political information, this does not necessarily imply that this segment is not also exposed to news. When only broadcast television is available, the audience is captive and, to a certain extent, watches whatever is offered on the few television channels. Audience research has confirmed a two-stage model according to which people first decide to watch television and then pick the available program they like best.”

That is, when few media choices are available, people tend to tune in for entertainment purposes. If news is the only thing that’s on, they’ll watch that over turning the TV off.

In a highly  efficient media environment, however, people can navigate directly to their program of choice. Some people may choose to informational sources for entertainment, but the majority of people will be able to avoid exposure to any news, seeing only the specific programming they are interested in. (I should mention here that much of Prior’s data is drawn from the U.S. context.)

As Prior further outlines in Post-Broadcast Democracy, an inefficient media environment therefore promotes what Prior calls “by-product learning”: people learn about current events whether they want to or not. Like the pop song you learn at the grocery store, inefficient environments lead to exposure to topics you wouldn’t explore yourself.

Interestingly, it seems that a similar effect may take place in the context of group problem solving.

In a problem-solving setting, efficiency can be considered as a measure of communication quality. In the most efficient setting, all members of a group would share the exact same knowledge; in an inefficient setting group members wouldn’t communicate at all.

Now imagine this group is confronted with a problem and works together to find the best solution they can.

As outlined by David Lazer and Allan Friedman, this context can be described as a trade off between exploration and exploitation: if someone in your group has a solution that seems pretty good, your group may want to exploit that solution in order to reap the benefits it provides. If everyone’s solution seems pretty mediocre, you may want to explore and look for additional options.

Since you have neither infinite time nor infinite resources, you can’t do both. You have to choose which option will ultimately result in the best solution.

The challenge here is that the globally optimal solution is hard to identify. In a bumpy solution landscape, a good solution may simply point to a local optimum, not to the best solution you can find.

This raises the question: is it better have an efficient network where members of a group can easily share and disperse information, or is better to have an inefficient network where information sharing is hard and information dispersal is slow?

Interestingly, this is an open research question which has seen mixed results.

Intuition seems to indicate that efficient information sharing would be good – allowing a group to seamlessly coordinate. But, there’s also some indication that inefficiency is better – encouraging more exploration and therefore a more diverse set of possible solutions. The risk is that a group with an efficient communications network will actually converge on a local optimum – taking the first good option available, rather than taking the time to fully explore for the global optimum.


The Nature of Technology

I recently finished reading W. Brian Arthur’s The Nature of Technology, which explores what technology is and how it evolves.

Evolves is an intentional word here; the concept is at the core of Arthur’s argument. Technology is not a passive thing which only grows in spurts of genius inspiration – it is a complex system which is continuously growing, changing, and – indeed – evolving.

Arthur writes that he means the term evolution literally – technology builds itself from itself, growing and improving through the novel combination of existing tools – but he is clear that the process of evolution does not imply that technology is alive.

“…To say that technology creates itself does not imply it has any consciousness, or that it uses humans somehow in some sinister way for its own purposes,” he writes. “The collective of technology builds itself from itself with the agency of human inventors and developers much as a coral reef builds itself from the activities of small organisms.”

Borrowing from Humberto Maturana and Fransisco Varela, Arthur describes this process as autopoiesis, self-creating.

This is a bold claim.

To consider technology as self-creating changes our relationship with the phenomenon. It is not some disparate set of tools which occasionally benefits from the contributions of our best thinkers; it is a  growing body of interconnected skills and knowledge which can be infinitely combined and recombined into increasingly complex approaches.

The idea may also be surprising. An iPhone 6 may clearly have evolved from an earlier model, which in turn may owe its heritage to previous computer technology – but what relationship does a modern cell phone have with our earliest tools of rocks and fire?

In Arthur’s reckoning, with a complete inventory of technological innovations one could fully reconstruct a technological evolutionary tree – showing just how each innovation emerged by connecting its predecessors.

This concept may seem odd, but Arthur makes a compelling case for it – outlining several examples of engineering problem solving which essentially boil down to applying existing solutions to novel problems.

Furthermore, Arthur explains that this technological innovation doesn’t occur in a vacuum – not only does it require the constant input of human agency, it grows from humanity’s continual “capturing” of physical phenomena.

“At the very start of technological time, we directly picked up and used phenomena: the heat of fire, the sharpness of flaked obsidian, the momentum of a stone in motion. All that we have achieved since comes from harnessing these and other phenomena, and combining the pieces that result,” Arthur argues.

Through this process of exploring our environment and iteratively using the tools we discover to further explore our environment, technology evolves and builds on itself.

Arthur concludes that “this account of the self-creation of technology should give us a different feeling about technology.” He explains:

“We begin to get a feeling of ancestry, of a vast body of things that give rise to things, of things that add to the collection and disappear from it. The process by which this happens is neither uniform not smooth; it shows bursts of accretion and avalanches of replacement. It continually explores into the unknown, continually uncovers novel phenomena, continually creates novelty. And it is organic: the new layers form on top of the old, and creations and replacements overlap in time. In its collective sense, technology is not nearly a catalog of individual parts. It is a metabolic chemistry, an almost limitless collective of entities that interact to produce new entities – and further needs. And we should not forget that needs drive the evolution of technology every bit as much as the possibilities for fresh combination and the unearthing of phenomena. Without the presence of unmet needs, nothing novel would appear in technology.”

In the end, I suppose we should not be surprised by the idea of technology’s evolution. It is a human-generated system; as complex and dynamic as any social system. It is vast, ever-changing, and at times unpredictable – but ultimately, at its core, technology is very human.


Initial Questions about Online Deliberation

While last semester I looked at gender representation in comic books by analyzing a network of superheroes, this semester I’m taking my research down a different path.

Through my Ph.D. I ultimately hope to develop quantitative methods for describing and measuring the quality of political and civic deliberation.

To that end, this semester, I’ll be looking at data from a popular political blog aimed at providing a space for political conversation. I have scraped this website’s entire corpus of nearly 30,000 posts from 2004 through the present, including posts and comments from 4,435 unique users.

From this, I plan to build a network of interactions – who comments on whose posts? Who recommends whose posts? Are there sub-communities within this larger online community?

Additionally, as I build my skill set in Natural Language Processing, I hope to do some basic text analysis on the content of posts and comments, looking for variation in word choice between communities as well as comparing the content of different types of posts – for example, are there keywords that would predict how many comments a post will get?

No doubt more questions will come up along the way, but as I dive into this data, these are some of the questions I’m thinking about.


Gender Representation in Comic Books

For one of my classes, I have spent this semester cleaning and analyzing data from the Grand Comics Database (GCD) with an eye towards assessing gender representation in English-language superhero comics.

Starting with GCD’s records of over 1.5 million comics from around the world, I identified the 66,000 individual comic book titles that fit my criteria. For each character appearing in those comics, I hand coded the gender for those with a self-identified male or female gender.

From this, I built a bipartite network – comic books on one side and comic book characters on the other. A comic and a character are linked if a character appeared in a comic. The resulting network has around 66,000 comic titles, 10,000 characters, and a total of nearly 300,000 links between the two sides.

From the bipartite network, I examined the projections on to each type of node. For example, the below visualization contains only characters, linking two characters if they appeared in the same issue. Nodes here are colored by publisher:

Screen Shot 2015-12-11 at 4.05.50 PM

The character network is heavily biased towards men; nearly 75% of the characters are male. Since the dataset includes comics from the 1930s to the present, this imbalance can be better assessed over time. Using the publication year of each comic, we can look at what percentage of all characters in a given year were male or female:

Screen Shot 2015-12-11 at 4.16.49 PM

While comics were very gender-skewed through the 1970s, in recent years, the balance has gotten a little better, though male character still dominate. If anyone knows what spiked the number of female characters in the early 2000s, please let know. I looked at a couple of things, but couldn’t identify the driving force behind that shift. It’s possible it just represents some inaccuracies in the original data set.

If you prefer, we can also look at the various eras of comics books to see how gender representation changed over time:

Screen Shot 2015-12-11 at 4.32.29 PM

I was particularly interested in applying a rudimentary version of the Bechdel test to this dataset. Unfortunately, I didn’t have the data to apply the full test, which asks whether two women (i) appear in the same scene, and (ii) talk to each other about (iii) something other than a man. But I could look at raw character counts for the titles in my dataset:

Screen Shot 2015-12-11 at 4.07.23 PM

I then looked at additional attributes of of those titles which pass the Bechdel test. For example, when were they published? Below are two different ways of bucketing the publication years: first by accepted comic book eras and the second by uniform time blocks. Both approaches show that having two female characters in comic books started out rare but has become more common, coinciding roughly with the overall growth of female representation in comic books.

Screen Shot 2015-12-11 at 4.38.07 PM

Finally, I could also look at the publishers of these comic books. My own biases gave me a suspicion of what I might find, but rationally I wasn’t at all sure what to expect. But now you can see, Marvel published an overwhelming number of the “Bechdel passed” comics in my dataset.

Screen Shot 2015-12-11 at 4.43.56 PM

To be fair, this graphic doesn’t account for anything more general about Marvel’s publishing habits. Marvel is known for it’s ensemble casts, for example, so perhaps they have more comics with two women simply because they have more characters in their comics.

This turns out to be partly true, but not quite enough to account for Marvel’s dominance in this area. About half of all comics with more than two characters of any gender are published by Marvel, while DC contributes about a third.


Proprietary Platform Challenges in Big Data Analysis

Today I had the opportunity to attend a great talk by Jürgen Pfeffer, Assistant Research Professor at Carnegie Mellon’s School of Computer Science. Pfeffer talked broadly about the methodological challenges of big data, social science research.

Increasingly, he argued, social scientists are reliant on data collected and curated by third party – often private – sources. As researchers, we are less intimately connected with our data, less aware of the biases that went into its collection and cleaning. Rather, in the era of social media and big data, we turn on some magical data source and watch the data flow in.

Take, for instance, Twitter – a platform whose prevalence and open API make it a popular source for scraping big datasets.

In a 2013 paper with Fred Morstatter, Huan Liu, and Kathleen M. Carley, Pfeffer assessed the representativeness of Twitter’s streaming API.  As the authors explain:

The “Twitter Streaming API” is a capability provided by Twitter that allows anyone to retrieve at most a 1% sample of all the data by providing some parameters…The methods that Twitter employs to sample this data is currently unknown.

Using Twitter’s “Firehose” – an expensive service that that allows for 100% access – the researchers compared the data provided by Twitter’s API to representative samples collected from the Firehose.

In news disturbing for computational social scientists everywhere, they found that “the Streaming API performs worse than randomly sampled data…in that case of top hashtag analysis, the Streaming API sometimes reveals negative correlation in the top hashtags, while the randomly sampled data exhibits very high positive correlation with the Firehose data.”

In one particular telling example, the team compared the raw counts from both the API and the Firehose of tweets about “Syria”. The API data shows high initial interest, tapering off around Christmas and seemingly starting to pick up again mid-January. You may be prepared to draw conclusions for this data: people are busy over the holidays, they are not on Twitter or not attentive to international issues at this time. It seems reasonable that there might be a lull.

But the firehouse data tell a different story: the API initially provides a good sample of the full dataset, but then as the API shows declining mentions, the Firehose shows a dramatic rise in mentions.


Rather than indicating a change in user activity, the decline in the streaming data is most likely do to a change in Twitter’s sampling methods. But since neither the methods nor announcements of changes to the methods are publicly available, it’s impossible for a researcher to properly know.

While these results are disconcerting, Pfeffer was quick to point out that all is not lost. Bias in research methods is an old problem; indeed, bias is inherent to the social science process. The real goal isn’t to eradicate all bias, but rather to be aware of its existence and influence. To, as his talk was titled, know your data and know your methods.