30+ free tools for data visualization and analysis

The chart below originally accompanied our story 22 free tools for data visualization and analysis (April 20, 2011). We're updating it as we cover additional tools, including 8 cool tools for data analysis, visualization and presentation (March 27, 2012), Startup offers 1-click data analysis (Aug. 29, 2012), Infogr.am offers quick Web charts (Oct. 16, 2012) and Create simple, free charts with Datawrapper (Nov. 21, 2012). Click through to those articles for full tool reviews.
Features: You can sort the chart by clicking on any column header once to sort in ascending order and a second time to sort by descending (browser JavaScript required).
Skill levels are represented as numbers from easiest to most difficult to learn and use:
  1. Users who are comfortable with basic spreadsheet tasks
  2. Users who are technically proficient enough not to be frightened off by spending a couple of hours learning a new application
  3. Power users
  4. Users with coding experience or specialized knowledge in a field like GIS or network analysis.     Next page: Screenshots of several tools

Data visualization and analysis tools


Tool
Category Multi-purpose
visualization

Mapping  

Platform
Skill
level   
Data stored
or processed
Designed for
Web publishing?
Data Wrangler Data cleaningNo No Browser 2 External server No
Google Refine Data cleaning No No Browser 2 Local No
R Project Statistical analysis Yes With plugin Linux, Mac OS X, Unix, Windows XP or later 4 Local No
Google Fusion Tables Visualization app/service Yes Yes Browser 1 External server Yes
Impure Visualization app/service Yes No Browser 3 Varies Yes
Many Eyes Visualization app/service Yes Limited Browser 1 Public external server Yes
Tableau Public Visualization app/service Yes Yes Windows 3 Public external server Yes
VIDI Visualization app/service Yes Yes Browser 1 External server Yes
Zoho Reports Visualization app/service Yes No Browser 2 External server Yes
Choosel Framework Yes Yes Chrome, Firefox, Safari 4 Local or external server Not yet
Exhibit Library Yes Yes Code editor and browser 4 Local or external server Yes
Google Chart Tools Library and Visualization app/service Yes Yes Code editor and browser 2 Local or external server Yes
JavaScript InfoVis Toolkit Library Yes No Code editor and browser 4 Local or external server Yes
Protovis Library Yes Yes Code editor and browser 4 Local or external server Yes
Quantum GIS (QGIS) GIS/mapping: Desktop No Yes Linux, Unix, Mac OS X, Windows 4 Local With plugin
OpenHeatMap GIS/mapping: Web No Yes Browser 1 External server Yes
OpenLayers GIS/mapping: Web, Library No Yes Code editor and browser 4 local or external server Yes
OpenStreetMap GIS/mapping: Web No Yes Browser or desktops running Java 3 Local or external server Yes
TimeFlow Temporal data analysis No No Desktops running Java 1 Local No
IBM Word-Cloud Generator Word clouds No No Desktops running Java 2 Local As image
Gephi Network analysis No No Desktops running Java 4 Local As image
NodeXL Network analysis No No Excel 2007 and 2010 on Windows 4 Local As image
CSVKit CSV file analysis No No Linux, Mac OS X or Linux with Python installed 3 Local No
DataTables Create sortable, searchable tables No No Code editor and browser 3 Local or external server Yes
FreeDive Create sortable, searchable tables No No Browser 2 External server Yes
Highcharts* Library Yes No Code editor and browser 3 Local or external server Yes
Mr. Data Converter Data reformatting No No Browser 1 Local or external server No
Panda Project Create searchable tables No No Browser with Amazon EC2 or Ubuntu Linux 2 Local or external server No
PowerPivot Analysis and charting Yes No Excel 2010 on Windows 3 Local No
Weave Visualization app/service Yes Yes Flash-enabled browsers; Linux server on backend 4 Local or external server Yes
Statwing Visualization app/service Yes No Browser 1 External server Not yet
Infogr.am Visualization app/service Yes Limited Browser 1 External server Yes
Datawrapper Visualization app/service Yes No Browser 1 Local or external server Yes
*Highcharts is free for non-commercial use and $80 for most single-site-wide licenses.

Startup offers 1-click data analysis

Spreadsheets are a good tool for looking at data; but if you want more robust insight into your information, software like SAS and SPSS can be somewhat daunting for the non-statistically savvy. "There's a huge gap between Excel and the high-end tools," argues Greg Laughlin, whose fledgling startup Statwing hopes to fill part of that space.
In fact, Excel includes a reasonable number of statistical functions -- the issue is more that even many power users don't know how and when to use them. The idea behind Statwing is to provide some basic, automated statistical analysis on data that users upload to the site -- correlations, frequencies, visualizations and so on -- without requiring you to know when, say, to use a chi-squared distribution versus a z-test.
Once you upload (or copy and paste) data to Statwing, you can select different variables to be used in analysis. The site determines what tests to run on the data depending on the characteristics of the factors you pick, such as your data's sample size and whether variables are binary (i.e. "for" and "against") or continuous (such as a range of numbers).
In one demo, data on Congressional SOPA/PIPA positions was matched with campaign donations from both the pro-SOPA/PIPA entertainment industry and anti-SOPA/PIPA tech lobbies. Statwing's analysis showed a "medium clearly significant" correlation between a legislator's support for SOPA/PIPA and the amount of entertainment industry political contributions he or she received (although there was no statistical significance between opposition to SOPA/PIPA and tech industry contributions).
Statwing sample analysis card
Sample Statwing analysis card

In the Statwing advanced tab, you can see how the site reaches its conclusions. In the SOPA/PIPA example, the correlation was determined via a ranked T-test, a variation on a statistical test that checks for differences between two groups when their variances -- that is, how much the values are spread out from the group's average -- may be unequal.
The site's analysis also found a medium significance in age and support for SOAP/PIPA, with the average age of Congressional supporters almost 6 years higher than opponents.
Statwing currently keeps all data and analyses private, but plans in the work will allow users to share links to data, download and export results and eventually embed analyses and data into a Web page. For now, the company consists solely of its two founders: Laughlin, a former consultant and product manager who sought easier data analysis tools, and John Le, an engineer and data scientist. Both are Stanford grads who previously worked at CrowdFlower.
Statwing was built using the Clojure programming language, Laughlin said, for "actual math" and data handling (not using, as I'd assumed, the R Project for Statistical Computing as the statistics engine); some Ruby on Rails for packaging and Web basics; Coffeescript, which aims to simplify JavaScript syntax; Backbone for organizing front-end JavaScript and the D3 JavaScript library for visualization. The company just launched from the YCombinator entrepreneurial incubator program last week.
Just how useful is Statwing? An automated data analysis service in the cloud is certainly no replacement for an in-house data scientist who can mine your mission-critical data. And, I'd be hard pressed to recommend making a multi-million-dollar business decision based on an automated analysis alone -- especially from a site that's still in beta. No automated tool can ask customized questions about the integrity of your data set or raise a red flag when you're jumping the gun from correlation to causation. Nevertheless, Statwing looks like an appealing resource for professionals who want to try taking their data skills up a notch from means, medians and pivot tables in Excel; it's an interesting way to learn at least one approach to statistically analyzing a data set, or perhaps brush up on statistical skills that have gone a little rusty since college.
If you sign up for the public beta, you can currently try and use the site for free. There will be a limited free option in the future, Laughlin said, with such accounts restricted to analyzing and storing just one data set at a time. Paid accounts will likely run anywhere from $20-$30/month to a couple of hundred dollars a month.


Source

Oracle buys DataRaker for its big data analysis tools

Oracle is planning to buy DataRaker in a move that will give it a cloud-based platform for analyzing data from smart meters used by energy utilities. Terms of the deal, which was announced Thursday, weren't disclosed.
Smart meters and the massive amounts of information they generate are frequently linked with the industry buzzword "big data," which typically refers to unstructured data formats. Oracle's pending acquisition of DataRaker ties into a broad movement by software vendors to sell products that customers can use to crunch these data sets for valuable insights.
DataRaker's technology will be combined with Oracle's application offerings for utilities, according to a FAQ document on the deal released Thursday.
It has a number of "high-performance, pre-packaged applications that can address many complex analytical challenges currently being faced by the utilities industry," according to the FAQ.
For example, DataRaker can help customers shorten the time it takes to handle calls, cut the number of field service appointments and give customers more personalized information, the FAQ adds.
DataRaker's staff will be rolled into Oracle Utilities. Its software is deployed at a number of utilities, covering more than 17 million smart meters, according to an Oracle presentation.

Source

What is Business intelligence ?

Business intelligence (BI) is the ability of an organization to collect, maintain, and organize data. This produces large amounts of information that can help develop new opportunities. Identifying these opportunities, and implementing an effective strategy, can provide a competitive market advantage and long-term stability.[1]
BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics.
The goal of modern business intelligence deployments is to support better business decision-making. Thus a BI system can be called a decision support system (DSS).[2]Though the term business intelligence is sometimes a synonym for competitive intelligence (because they both support decision making), BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. If understood broadly, business intelligence can include the subset of competitive intelligence.[3]

Top 5 Challenges of Data Warehousing

Data warehousing projects are one of its kinds. All data warehousing projects do not pose same challenges and not all of them are complex but they are always different. This article illustrates the top 5 challenges that often plague modern data warehousing developments. Knowing these challenges upfront is your best bet to avoid them.
Data warehousing is different. They are different because unlike many of the software projects, data warehousing projects are not developed keeping a front-end application in mind. For the most part of it, these projects are heavily dependent on the backend infrastructure in order to support the front-end client reporting. Moreover, number of different stake holders involved in data warehousing projects is usually more than any typical IT project.
But these are not the only reasons why doing data warehousing is difficult. In the below list we show the top 5 reasons which actually make things complex on the practical ground.

Ensuring Acceptable Data Quality

Disparate data sources add to data inconsistency
More often than not, a data warehouse consumes data from disparate sources. Most of these data sources are legacy systems maintained by the client. These systems are usually managed by different people pertaining to different business departments. Since the business lines supported by these systems are different, the users of one system are often oblivious to the features or capacities of the other system. Because of this, a lot of business processes and data are duplicated across systems and the semantics are different in them. For example, the definition and calculation of revenue in “direct sales” department may be different from that of “Retail Sales” department. The list of customers maintained in “sales” department may be different in quantity and metadata quality with the list of customers maintained in “marketing” department.
When a data warehouse comes in between and tries to integrate the data from such systems, it encounters issues such as inconsistent data, repetitions, omissions and semantic conflicts. All these issues lead to data quality challenges. Resolving these issues and conflicts become difficult due to limited knowledge of business users outside the scope of their own systems.
The next reason which causes data quality issues is the fact that many a times data in source systems are stored in non-structured format like as in, flat files and MS Excel. These types of data structures are inherently susceptible to issues such as redundancy and data duplication.
Unstabilized source systems
The issues of data quality do not always originate from legacy systems. In fact, data quality issues may become more disastrous in case if a source system is comparatively new and has not fully stabilized yet at the time of data warehouse development. In some rare cases, data warehouses are built simultaneously with the source systems. In those cases, instability and vulnerability of source systems often wreck the overall development of data warehouse and ruins the data quality of it. This is because any bug in the source systems potentially injects unwarranted defects in data warehouse. Given any possibility, any plan of building data warehouse simultaneously with source systems should always be avoided, in my opinion.
This is why creating data warehouse for an organization with good master data management, relational database source systems, and cross-trained and knowledgeable users is often easier.

Ensuring acceptable Performance

Prioritizing performance
Many designers and users often forget about performance when they first conceive the plan to implement a data warehouse for their business. As is often the case, such oversight cripples the usability of a data warehouse when it is finally built. Indeed, little can be done to improve the performance of a data warehouse in the post-go-live period. This is because performance objectives are easier to be designed than to be tuned. Data warehouses should be built for performance rather than tuned for performance.
Setting realistic goal
Achieving the performance objectives is not easy. In the first place, setting up performance objectives itself is a challenging task. An untrained user can easily drift towards setting up some performance goals that are unrealistic for a given data warehousing scenario. Hence for the users of the data warehouse, it is generally considered safe to set up the performance goals in terms of practical usability requirements. A crude example will be, if one business user requires a specific report to be available at 9 AM daily then that should be given as the performance requirement by the users instead of stating requirements such as, the report must not run for more than 15 minutes.
Performance by design
Once reasonable performance goals are setup, the next task is to finding ways to achieve those goals. People often tend to believe that performance of a system depends on the hardware infrastructure and hardware augmentation is a good way for boosting performance. This understanding is incorrect. While it is true that a better hardware will generally ensure a better performance, the performance of a system is in fact more fundamental than this. Performance is directly dependent on the complexity of the system which, in turn, depends on the design. To give a relevant example, think of join operation in database. A nested-loop join can have a worst case complexity of O (n*n) whereas a merge-join can do the same thing only in O (nlogn). It’s easy to see that for a practical value of n (n being number of rows); one of these joining algorithms may run thousand times faster than the other. If the design of your system facilitates the database to perform a merge join instead of a nested-loop join, then that would give a huge performance benefit to your system. That would be something which is quite unachievable only by augmenting hardware infrastructure. Hardware augmentation cannot achieve the same level of performance boost since it would not be possible to increase the hardware by thousand times.
Performance is a consequence of design. So performance goals can be best addressed at the time of designing. If that’s not done, meeting up performance criteria can be an overwhelming challenge.
Like anything in data warehousing, performance should be subjected to testing – commonly termed as SPT or system performance testing.

Testing the data warehouse

Testing in data warehousing is a real challenge. A typical 20% time allocation on testing is just not enough. One of the reasons why testing is tricky is due to the reason that a top level object in data warehouse (e.g. BI reports) typically has high amount of dependency. For example, one cross subject area report built over a dimensional data warehouse will be dependent on data from many conformed dimensions and multiple fact tables that themselves are dependent on data from staging layer (if any) and multiple disparate source systems.
Test planning
Because of such high dependencies, regression testing requires lot of planning. Making the data available for re-testing for a certain component may not be possible as fresh data loading often changes the surrogate keys of dimension tables thereby breaking the referential integrity of the data. Thus continuing fresh testing along regression testing becomes impossible.
One solution is to plan the testing activities in batches that are in-line with the batches of data loading. This needs to be planned keeping in mind the availability of the data from dependent source systems as every source system may not provide data in the same extraction frequencies and windows.
No automated testing
Till date, there is no full-proof generic solution available for testing automation in data warehouses. Not that it is impossible. But it is very difficult given the lack of standardization in how the metadata are defined and design approaches are followed in different data warehousing projects. There are a few commercial solutions that depend on metadata of the data warehouse but they require considerable customization efforts to make them workable. Unavailability of automated testing opportunity also implies that right kind of skill set will be necessary in the testing team to perform such tasks.
Read more about data warehouse testing here

Reconciliation of data

Reconciliation is a process of ensuring correctness and consistency of data in a data warehouse. Unlike testing, which is predominantly a part of software development life cycle, reconciliation is a continuous process that needs to be carried out even after the development cycle is over.
Reconciliation is complex
Reconciliation is challenging because of two reasons. The first one is – complexity of the development. Generally a few critical measures are chosen from the business for the purpose of reconciliation. Imagine the measure is – “net sales amount”. This measure is calculated independently and separately in the source system end and data warehouse end to check if they tally. In order to develop this, one must imitate the entire transformation logic that are there in the data warehouse and applicable on this measure. Obviously one can check the existing logic from the developed ETL layers, nonetheless developing this is technically involved.
The second reasons that makes reconciliation challenging is the fact that, reconciliation process must also comply with performance requirement – which is more stringent than usual. I will explain why that is so. The reconciliation is like a certificate on the correctness of loaded data. A successful reconciliation gives the necessary confidence to the users for trusting the data for their business. Thus, it is imperative that reconciliation process gets completed by the time the business users intend to use the data. Considering that reconciliation can only start after the completion of data loading and should get finished before users start using the data, leaves this with very little time for execution. But even within that short time, the process needs to calculate functionally the same measures that are calculated in full-blown ETL process of data warehouse.
Read more about reconciliation here

User Acceptance

Last but not the least is the challenges of making a newly built data warehouse acceptable to the users. No matter how good or great you think your data warehouse is, unless the users accept and use it wholeheartedly the project will be considered as failure. In fact, most of the data warehouse projects fail in this phase alone.
Reluctant users
A new data warehouse brings with it new set of process and practices for the users. In many cases, business users need to forsake their long standing practice and habits of using their legacy systems to adapt themselves with the new processes. Humans, by nature are not very comfortable to adapting to changes, especially if they do not see great value propositions for doing so. Their reluctance or lack of interest in using a new kind of reporting system can render the data warehouse practically useless. The challenge here is to make them accept the data warehouse organically and seamlessly.
Users training, simplification of processes and designs, taking confidence building measures such as reconciliation processes etc. can help users come into terms with this new system easily.

Original Article

Mark Zielinski on Making Use of Big Data Now

Today, I have the pleasure to welcome Mark Zielinski, co-founder and former director at Winning Research in Toronto. He writes about analyzing social network traffic to better understand patterns and derive knowledge from them. Thanks for your contribution Mark.
Since late 2011, the market research industry and market research technology in general has been very focused on the coming rise of “big data”, and what that can mean for professionals in market research.  There has been all sorts of speculation about how the analysis of organic data and passive data “floating around” out there, such as Twitter, LinkedIn, Pinterest, and Facebook traffic could change the way we work very soon.  Companies looking to stay competitive can’t keep doing the same tired old things – they need to keep their ear to the trends, be resourceful, and come up with creative ideas.
The general consensus seems to be that “big data” is never going to replace traditional research – that is, specific research methods like surveys and focus groups that deal with particular topics will always be around.  These specific research methods answer the question of “what” – that is, they are concerned with empirical details.  For example, an online survey may indicate that compared to 2011, this year 5% of soda drinkers no longer drink Coca-Cola on a regular basis.  Where “big data” aims to change research is in the “why” – that is, the broad trends and underlying reasons why certain results have been obtained.  Using our previous example, “big data” may be able to tell us that key nutritional influencers have recently been saying that carbonated sugary drinks reduce life expectancy by an average of 4 years in healthy individuals.  By having both the “what” results and the “why” results, researchers can use this combination of data to have a much clearer picture of a particular situation, and potentially be able to advise their clients on how to act to obtain the results they wish to achieve.
Many research companies, especially the multitude of smaller research agencies, are in a tough situation.  They see these “big data” trends and recognize their importance, but being researchers and not technologists by trade, they wait for an emerging vendor that will fill in this technological gap for them and allow them to stay relevant.  While there are emerging solutions in the market that aim to provide a “one stop shop” for companies to get the big data they need, most are not tailored specifically to market research.  Some of the biggest demand for big data analytics comes (perhaps unsurprisingly) from the finance sector, where trending topics can mean instant influence in the stock market for investors and traders.  In many cases, subscription fees for these services are in the tens of thousands of dollars per user per month, and when you have clients who are still uncertain about the value of the service, it’s a tough expense to justify for many research agencies.
What’s the solution in the current no-man’s-land of big data for market research?  Should research companies just sit back and wait for an affordable, tailored solution to come their way?  Certainly, with enough time, it’s inevitable that somebody will fill that role.  However, at that point, the competitive advantage of being the “first” in that space will have evaporated.  Once there’s a solution that everybody is using, you will become yet another commodity among the other research houses or fieldwork agencies that use that tool.
If you still haven’t invested in a good developer at your company, now may be the time to do so, unless you have a technical, inquisitive mind yourself along with some extra time.  There are fantastic resources out there to begin your own foray into analyzing the constant stream of data in the online world.  Check out two books by O’Reilly publishing that will get you thinking: 21 Recipes for Mining Twitter and Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites.  Although they both deal with the Python programming language, and as such require at least some basic background in computer programming, they contain a multitude of ideas and concrete examples that are very much applicable to the market research industry today.  What you start with may be very basic – such as just deriving the most popular trending topics in a given location – but even that will be an example of “big data” that you can use to enhance the results and analyses you deliver to your clients, however minor.
It’s easy to throw up your hands and say, “I don’t know what I’m doing”, and hope for someone to come along and give you what you want.  But the data is out there already, every minute of every day, and even if the value you get from it at the beginning is very minor, you’ll be able to give you clients something more than they’ll get from every other research agency that’s resting on its laurels.
Go out there.  Try something.  If it fails, try it again, in a different way.  Get help when you need it, but don’t let the world gradually pass you by until one day you wake up and realize you’re irrelevant.

Original Article