Google News as an Analytic Database ~ Suresh Reddy Medapati Blog

I was browsing the L.A. Times digital newspaper recently and came across an interesting article entitled “Mexico,before and after Calderon’s drug war”. The horrific toll of Mexico’s war on drugs – over 50,000 deaths in the last six years alone – is well documented and much-attributed to then-President Felipe Calderón’s initiative launched in late 2006.

The article cites a report “Drug Violence in Mexico” by The Trans-Border Institute at the University of San Diego, proposing, however, that the violence probably started several years before Calderon assumed office. This is important “because some critics argue that the Calderón administration launched its assault on drug traffickers as a political move to legitimize the administration after the controversial presidential election of 2006.”
What especially caught my eye was that one of drug violence report’s authors, doctoral candidate Viridiana Ríos, reached the earlier violence commencement conclusion by “combining available crime figures with ‘a multiple imputation algorithm and Bayesian statistics’ as part of her Ph.D. dissertation” in Government from Harvard University. Additional investigation of Rios led me to her current home at Harvard’s Institute for Quantitative Social Science, and an even more fascinating study, “Knowing Where and How Criminal Organizations Operate Using Web Content,” showcasing a new framework for political science research.
The “Knowing” paper introduces a methodology “that uses Web content to obtain quantitative information about a phenomenon that would otherwise require the operation of large scale, expensive intelligence exercises. Exploiting indexed reliable sources such as online newspapers and blogs, we use unambiguous query terms to characterize a complex evolving phenomena and solve a security policy problem: identifying the areas of operation and modus operandi of criminal organizations, in particular, Mexican drug trafficking organizations over the last two decades ... our findings provide evidence that criminal organizations are more strategic and operate in more differentiated ways than current academic literature thought.”
The authors propose an approach called MOGO – Making Order using Google as an Oracle – that engages both computer science and social science disciplines. The points of departure for MOGO are the search topics or “actors” of interest into categorized collections or “actor lists.” The lists are in turn combined into query sets. Finally, these sets are submitted to a “crawler” program that queries a knowledge base “oracle” and accumulates statistics for subsequent analyses.
For this investigation, the crawler is a Python program that uses a Google API to return JSON pages which are then parsed and “hits” tabulated. Google News, a generally reliable indexed source of newspapers and blogs, is the oracle that’s queried. The output from this step is then fed to a knowledge discovery process that validates, cleans and refines the data. “To clean, we normalized the total number of hits we are getting using a hyper-geometric cumulative distribution ... To validate, we compared information from other cases, cases in which information is known with certainty, with the one we extracted using MOGO.”
Based on analytics derived from the MOGO methodology, the “Knowing” research was able to articulate the geographic behavior of 13 drug-trafficking organizations in Mexico, including their migration patterns and “marketing strategies.” One provocative finding: “our information provides the first portrait of the market structure of the illegal drug trafficking within Mexico and of its changes over time. Mexico's organized crime is not the oligopoly the theoretical literature of organized crime and private protection rackets assumes; rather, drug trafficking organizations share territories frequently.”
All the data science pieces – computer science, social science and statistics – converge to identify the trafficking organizations’ “phenotype.” With the different characteristics that emerge from the refinement process, the study deploys the k-means clustering algorithm to identify four classes of trafficking organizations: “Traditional,” which have been in operation the longest; “New,” which emerged on average 10 years after Traditional; “Competitive,” which operate in territories where others are already trafficking; and “Expansionary Competitive,” which are both expansionary and explorative.
The paper concludes that the MOGO methodology’s a great starting point for gathering low-cost intelligence information. That MOGO passed several important “face validity” comparisons of its findings with known results of other investigations, says a lot about its computer science and statistics chops. And that its findings debunk several generally-held notions of the homogeneity of trafficking organization’ operations will get the attention of social science and policy researchers.
For me, this is more evidence of the tight emerging ties between the quantitative social science of academia and the data science of business. Very cool stuff.

Original Source : http://www.information-management.com/blogs/google-news-as-an-analytic-database-10023887-1.html

Suresh Reddy Medapati Blog

Pages

Google News as an Analytic Database

0 comments:

Post a Comment

Labels

Followers

Blog Archive

About