Calculating Words: April 2010

Monday, April 19, 2010

Some Cool Processing Creations

I've begun my introduction to Processing, and have been searching the interwebs for cool little pieces of art produced in the environment. I came across a couple that I think are worth sharing. They're not exactly dataviz related, but they are a pretty stunning use of Processing. This one appears to show the movie (?5th Element?) frame by frame across a wall. I found this:

This one's a little weird, unless you're into (very) modern interpretive dance... interpretive augmented reality?

Here's an audio visualization of a song. I can't tell if the entire thing is programmed, or if it's more.. custom to the song - where a lyric's appearance was crafted to appear when it did. I think some iTunes (and WinMedia Player) visualizations have tried to incorporate song lyrics into their graphics, but I haven't seen one that's really awed me much. This one seems ok tho...

I can't wait to work this into my analyst's toolkit. Simply from the one tutorial I've completed, I can tell there's a LOT still to learn. I don't need to know it all, which helps, but the things I do need to learn are going to take some time.

Pick o' the Post: Cold Shot by Stevie Ray Vaughan on Couldn't Stand the Weather

Saturday, April 17, 2010

Processing and... processing.

Pick o' the Post: The Odyssey by Symphony X on The Odyssey

This is about as epic as a song can get. Yes, more epic than Rush - gimme a break.. this is Symphony X. This song is a 24min. musical rendition of the story of Odysseus' adventures in the story The Odyssey. I don't really know the story that well, to be honest, but I did get the privilege to experience this (entire) song performed live in Atlanta. That is something I will never forget. I was there to hear them perform their latest album Paradise Lost, which is a rendition of the classic epic novel. The entire album is an amazing feat, but the title track is pretty cool for metal/non-metal audiences alike.

So, I got my hands dirty with Processing last night, and I'm glad I did. I was following along with a post from a blog I follow, and I'm really excited to get some more experience here. Processing feels like a bridge from working strictly within R or Stata or Open Office for plotting solutions. Processing feels much more open to the imagination. The difference is that there's quite a bit more programming involved developing graphics in Processing than there is in in R or Stata, and certainly Open Office.

I haven't figured out how to export images yet - actually, I ran across it once, but I'm too lazy to figure that out right now. So, these are screenshots of the graphics I created with the tutorial. First, I'll briefly explain the data and the analysis that's going on. Jer sent a request out on twitter to have any interested followers tweet a random number (from their head):

He put the 225 human-generated random numbers into this (publicly available) google spreadsheet. The tutorial works with data stored, more generally, in a remote location, like a google spreadsheet. He cites not having to change filepath names when data moves on your own system as a good reason to try to keep data in centralized location ("centralized" is relative). I can relate to that sentiment... with much (if not all) of our data stored on our servers in the office, I know exactly where the data I want at any particular moment is, and don't have to fumble around trying to find the data I want to load into memory for analysis.

Anyway, from here Jer leads readers through a number of methods to analyze the data. Below is row after row of machine-generated random numbers, and one of those rows is the human-generated data. Each column is a number 1-99, each machine-generated row represents 255 random numbers. The brightness of the ellipse indicates how many times that number is present in that set of 255 numbers.

So, this a crude, first round of analysis. Obviously, it's damn near impossible to visually pick out the human-generated row. It's the 37th row from the bottom of the image (36th row from the top). I adjusted some parameters to highlight the our dataset of interest - it's approximately twice as bright as the other rows in this image. I also summarized some of the next plots into one graphic below this one, observing the increased visual definition that we can get from a bar graph, also adding color gradients to emphasize various bits of data.

Before I go any further with the tutorial, the idea of perspective was emphasized by Jer. This is why we first observed the bright/dull points, then the bar graphs, which we then applied color to. Comparing this bottom-most graph to 6 rows of machine-generated numbers, our human-generated data starts to look a little outlier-ish. Our data is the top row of the next image.

It occurred to me that an extra step in manipulating this graphic might make this "outlier-ish" observation more clear. Ordering the bars based on their height (color), we should be able to get a better idea of what the difference is here. Without this additional adjustment, maybe it's that the observer is left to "calculate," in some sense, their own order to compare each row.

Now, the dissimilarity is a bit clearer. There are at least a few numbers that our human subjects tend to pick, seemingly, a bit more often than random. Jer continues on to display two more visual representations of this same data to try to find some pattern. First, using a grid with color gradients. The first row is 1-10; second, 11-20; and son on.

Then displaying the same grid, but displaying the numbers (colored with a gradient) instead of the squares.

Aside from the Douglas Adams effect (#42), as Jer points out, these random numbers generated by 225 of his followers seem to have chosen numbers ending in 7 quite a bit more than we might expect. He conjectures if there is something about the number 7 that seems "more random" to us (or, less generically, his 225 participants). Interesting though.

I'm glad I made it through this (my first) Processing tutorial. I'm looking forward to applying these tools and concepts to the unique data that Booklamp affords my imagination.

Have a wonderfully data-filled day.

Wednesday, April 14, 2010

Validata

I attended a close cousin's wedding this weekend (congratulations John and Callie), and on the trip out there I was pushing about how important collecting and understanding data will be to our society's future. My father kept stressing that 'the ones who control whatever data it is will continue to have the ability to manipulate it'. Yes, maybe, but with a diminishing effectiveness.

FlowingData posted an article today about TransparencyData, a new project aimed to making data more accessible to the public. TransparencyData is one of a number of projects starting up intent on allowing the public to inform themselves. These are mostly government-data-type sites for now, but these projects will inspire development into more specific sectors of our economy and our daily lives.

My argument was that, in time, we will reach a point where data has the ability to validate and invalidate itself. Corruptions and fraudulent uses of data will be more visible to anyone interested in the information that some given dataset provides.

We're only now seeing what might be some sort of start to this notion. Making data and information more observable than ever is the first step. Included in this, is the concept of linked data.

I went so far in my argument for what data will do for our society, to say that at some point we will be governed more and more by data. On the surface, that sounds very USSR/planned-economy/scary type talk. But, I don't think that's what I mean to symbolize. The ultimate social decisions, I would think, should still be made via the democratic process. Data simply provides an avenue for more intelligent decisions to be made, and leaves less room for fraud and other mishandling in government.

Monday, April 12, 2010

Cleeshei

Pick o' the Post: "Chasin' the Trane" by John Coltrane ... gets a little "run-on-ish," but I think PBS puts it best: "it's not about every word being right... it's a novel [, not a poem]." I just heard this for the first time tonight and there are some amazing moments in this 80+-chorus-long solo. There are some spots that require some endurance by the listener as well - like one part, around 11:00 that just sounds like an elephant... and then he makes some other unintelligible noises. Definitely worth the 15 minutes. Listen as we contemplate how we might measure a cliché... ooohhhhhmmmmm,

My flight'd just gotten in to BOI @about 7:45p this evening, and I gave Aaron a call about some questions he had. He told me, "don't work to hard"; and I felt the complete opposite since I'd missed Thursday and Friday of last week. So, I responded with a twitter reply: I guess we would be more succeptible to overworking ourselves, when our work is a passionate one. I'll be careful.

It's true, but... it sounds preachy. Oh well. it got me thinking about clichés. What are clichés? ... computationally, what are they? .. is that possible?

I remember, as a kid, really wanting to be able to say the right thing at the right time... all the time, like a wise, old man. That seemed very valuable to me - poignant little truths. A very efficient form of communication maybe. My work at Booklamp is a little ironic in that I don't really seek out time to read, but always (in retrospect) had a passion for how to use words. I've come to learn that saying the right thing at the right time - when you pull it off - is much more than words, but the words are (most times) the most critical ingredient to the social concoction that illicits that electric feeling of and sense understanding. This is very much a result of how my father raised me - Happy birthday Dad (April 12th)! ... one word: details.

Anyway, it occurred to me that clichés have similar properties; being efficient, sometimes comedic, poignant little sayings, or phrases. What's a cliché? computationally, what's a cliché?

I'm not (directly) trained in how this question might be answered, but I think my definition is a thought provoking one. A cliché is:

The redundancy of some given meaning, conditioned on a given context.

There's probably much more to it then that, but that (for me) does a pretty good job of explaining what I think I believe [<=redundancy, but no cliché..] a cliché is. Come to think of it, maybe there should be a measure of irony in a general definition of what a cliché is. Regardless, the italicized definition above sounds very much like a matter of probability. The trick - which, I know is not an easy trick - is to create some sort of measurement to quantify a cliché's "meaning" and "context". Then you gotta use that info to separate (some how) ordinary redundancy from clichéd redundancy, and context. That sounds like a lot of work.

I know for a fact, that is very hard. But if you can do it, then maybe you too can measure the clichéiness in your life, or your books. If Booklamp does it, then we'll possibly know something about how original a text might be... that we with millions of other texts. Maybe later we'll license services to measure you're own clichéiness... probably not though.

My job has taught me that statistical inference can be a bitch sometimes. I have no idea what I'd do without my text books and the interwebs. Thank you Algore (I made it all one name now.. Algore) for our great series of tubes... I don't know what I'd do without you... Well put Senator Ted Stevens, very well put.

Tuesday, April 6, 2010

IBM Many Bills

I stumbled across IBM's newest data app... Many Bills. Haven't had a chance to thoroughly check it out, but it appears to make the perusal and consumption of the many thousands of pages that bills can consist of, a somewhat less daunting task.

I'm really starting to like IBM's perspective on our world, in the midst of a paradigm shift. IBM's previous data creation for all of us to enjoy was Many Eyes, which allows users to analyze and visualize public and user-uploaded data. There's only one other company I'd want to work for when it comes to data... Booklamp.org

Pick o' the Post: "Heir Apparent" by Opeth on Watershed (2008)

PS Here's a neat little real-time news app that supposed to show where news is coming from and some other info... haven't gotten too far into it, because it doesn't have anything to do with my job (I'm at work).

Calculating Words