About the BEANS software

BEANS software is a web based, easy to install and maintain, new tool for interactively distributed data analysis. It provides a clear interface for querying, filtering, aggregating, and plotting data from an arbitrary number of datasets and tables. Its main purpose is to simplify the process of examination and finding new relations in the data using the powerful Pig Latin language.

Introduction

The amount of data in science which is delivered nowadays increases like never before. It applies for all fields of research. However, physics and astrophysics appear to push the requirements for data storage and analysis to the boundaries. With already existing missions and for future projects like LSST or SKA the need for reliable and scalable data storage and management is even higher.

In order to manage data and more importantly to gain new knowledge from it, it is crucial to have a proper toolbox. The amount of data for many projects is too big for simply processing with bash or python scripts. Tools which would allow to simplify data analysis are crucial in research these days.

The BEANS software was initially created for easy data management and analysis of numerical simulations done with the MOCCA code. The MOCCA code is currently one of the most advanced codes for simulations of real-size globular star clusters (Hypki et al. 2013).

One MOCCA simulation on average provides around ten output files which take in total around 10-20 GBs, depending on the initial conditions of a star cluster. Simulations for a set of various initial conditions can easily exceed hundreds of simulations and the requirement for data storage increases to TBs. Although it is much less than the needs of huge astrophysical surveys today, it is crucial to have a proper set of tools to easy manage, query and visualize data from so many datasets.

Although the BEANS software was created to manage the simulations from the MOCCA code, it is written in a very general form. It can be used in any field of research (economy, physics, biology, etc.) or other open source projects. The only requirement is to have data in a tabular form (not applicable for FITS/HDF5 format files currently).

Underlying technologies

In this section main technologies which are used in the BEANS software are presented. Their value was already well proven in the industry (Google, Yahoo, Netflix, eBay, Twitter, Reddit and many more). However, in my opinion, beyond any doubts they are equally powerful for science too.

The BEANS software works in server -- thin clients mode. It means that all calculations are performed on a server or more preferably, clusters of servers. Thin client is either a browser or a command line interface which specifies queries, plots, imports or exports data. The whole analysis is performed on the server side, though. In this way a client can be any device equipped with a web browser (PC, tablet, phone, etc.).

The database which is used by the BEANS software is Apache Cassandra. It is decentralized, automatically replicated, linearly scalable, fault-tolerant with tunable consistency database written in Java. In comparison to well known relational databases like MySql or MsSQL, it is designed from scratch for distributed environments. Thus, Apache Cassandra solves a lot of problems which relational databases were struggling with (e.g. data replication). Thus, one person can easily manage medium size cluster of nodes.

Elasticsearch is the second major component of the BEANS software. It is a powerful open source search engine. It is used in the BEANS software for full search of datasets, tables and plots metadata.

Elasticsearch and Apache Cassandra are example of emerging NoSQL technologies. NoSQL in this context means that these databases do not demand full details of data schema like relational databases do. Their data model is more flexible and can be much easier altered in the future. This is perfect for the BEANS software. Instead of defining numerous different tables with different columns, one can have only one table which can store any type of data. This simplifies development and allows the BEANS software to store arbitrary number of different tables.

One of the greatest advantages of using Apache Cassandra is its integration with MapReduce paradigm (Dean et al.). MapReduce was design to help process any data in distributed manner and it consists of two stages. In the first stage, called map, data is read and transformed into new pieces of information. In the second stage, called reduce, data produced in the first stage can be aggregated (sum, average, count, etc.). This approach is perfect for problems which are easy to parallelize, it allows to have linear scalability and works on commodity hardware. This last feature is very important because it allows to create powerful computer clusters relatively cheap.

MapReduce is a low level solution. In order to use the power of distributed data analysis more efficiently, there were created many higher level languages and libraries. These libraries transform queries internally to sequence of MapReduce jobs. Among the most popular there are Apache Pig, Hive or Cascading. They represent different approach for the same problem. For example Hive is very similar in syntax to SQL language. However, for the BEANS software, I decided to choose Apache Pig project. Scripting language for Apache Pig is called Pig Latin. The main reason for this is that in my opinion scripts written in Apache Pig are much more intuitive and clear than in any other high level language.

Additionally, the BEANS software is being integrated with a number of plots produced with D3 library. The D3 library allows to create unusual, very clear and very powerful interactive plots.