GSoC/GCI Archive
Google Summer of Code 2015

HPCC Systems®

License: Apache License, 2.0

Web Page: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+GSoC+2015+Ideas+List

Mailing List: http://hpccsystems.com/mailman/listinfo/hpcc-dev

Reed Elsevier is a world leading provider of information solutions for professionals. Reed Elsevier is a global organisation operating in 24 countries and is also listed on several of the world’s major stock exchanges as well being a FTSE 100 and FT Global 500 company. We operate across several professional market segments including Scientific, Technical and Medical, Legal and Exhibitions. We also operate in the Risk Solutions and Business Information sector and LexisNexis Risk Solutions is a leading provider of solutions that combine proprietary, public and 3rd party information with advanced technology and analytics.

We produce an open source platform to solve ‘Big Data’ problems called HPCC Systems® (High Performance Computing Cluster) which can process, analyse and find links and associations in high volumes of complex data at high speed and with incredible accuracy.

HPCC Systems® has been around for more than a decade, making it the most proven solution of its kind in existence. It is a battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. The data refinery and delivery clusters (Thor and Roxie, respectively) do not need specialised hardware to run, and are usually configured identically.

Thor is the data refinery cluster which consumes vast quantities of data. Here the data is transformed, linked and indexed (ETL). It’s a distributed file system with parallel processing power spread across the nodes. Thor clusters can scale from a single node to thousands.

Roxie is the query cluster which provides separate high-performance online query processing and data warehousing. It is optimised for concurrent query processing and like Thor, it also scales from a single node to thousands.

We have created our own programming language to run on these clusters called, ECL (Enterprise Control Language). It is a powerful programming language, ideally suited for the manipulation of Big Data for many reasons. It is a transparent and implicitly parallel language which is non-procedural and dataflow oriented. Its syntax is also modular, reusable and extensible. It combines data representation and algorithm implementation and is easily extended using C++ libraries. In fact, it is compiled into optimized C++.

As well as the clusters, an HPCC System incorporates common middleware components, and external communications layer, client interfaces for end user services and system management and auxiliary components to support monitoring, facilitate loading and storing of filesystem data from external sources.

We have our own development environment for developing code (ECL IDE) and the interface we provide for the monitoring and management of HPCC Systems® environments is called ECL Watch.

LexisNexis® (a division of Reed Elsevier) consolidated its content management, document enhancement and mining systems onto HPCC Systems®. In doing this, they discovered that the elapsed time taken to perform an enrichment pass of the entire data collection dropped from 6/8 weeks to less than 1 day. This is such a significant change that it means they can increase the degree of enrichment into capabilities that were previously out of reach.

Also within Reed Elsevier, HPCC Systems® has enhanced SciVal which is an analytical solution used by research institutions to get access to facts which support decision making. SciVal users can now customize their own visualizations generated in just a few seconds from the almost 40 terabytes of data underpinning the tool. Since HPCC Systems® can both crunch huge, predefined requests offline as well as smaller calculations in real time on customer generated data slices, users of SciVal can get structured information immediately upon implementation as well as having the flexibility to tailor-make their own view of the data.

It’s not just within Reed Elsevier that we hear of success stories about HPCC Systems®. Sandia National Laboratories discovered that HPCC Systems met the challenge of sorting through petabytes of data to find correlations and helped them to generate hypotheses.

The Office of the Medicaid Inspector General (OMIG) of a northeastern US state used HPCC Systems® to help pinpoint fraud, waste and abuse of the system. HPCC Systems® enabled them to identify the hidden relationships between million-dollar condo dwellers and their assets, providers, medical facilities or others providing care to the state’s Medicaid recipients. They were able to sort, link, join and analyse 50 terabytes of public data to find evidence of fraud committed by a group of state Medicaid recipients who were all living in the same high end condominium complex.

Other customers include organizations such as government agencies, businesses, non-profits, academic institutions and consulting firms such as Comrise, Cognizant, Infosys and others. Case studies can be found here: http://hpccsystems.com/why-HPCC/case-studies.

Our technology has also been used as a resource to help find missing children. LexisNexis® has partnered with the National Center for Missing & Exploited Children (NCMEC) to develop the ADAMSM Program (Automated Delivery of Alerts on Missing children) which uses technology to distribute missing child posters to police, news media, schools, businesses, medical centers and other recipients within a specific geographic search area. To date, 142 children have been reunited with their families as a result of the ADAMSM Program.

Working on the HPCC Systems® project means that you get to contribute to success stories like these. Our developers are passionate about what they do and seek to improve the system so that we can provide a better service to all our customers. We have already achieved a lot but there is much more to be done. Come and join us and make your own contribution.

Projects

  • Expand the HPCC Systems Visualization Framework (Web Based) This project focuses on improving/extending the HPCC systems visualisation framework and making it more awesome. The minimum set of deliverables of the project are : 1. A wrapped third party visualization 2. A custom built visualization 3. Both driven by data on the HPCC Platform 4. Test cases demonstrating the correct behavior and performance 5. Supporting Documentation
  • HPCC Systems - Add statistics to Linear and Logistic Regression Module This proposal is to provide automated statistics for Linear and Logistic Regression Modules in ECL-ML Project. Following Statistics are added : For Linear Regression : 1) T statistic and and P Value for each beta, 2) Adjusted R-squared, 3) P Value for the F statistic. For Logistic Regression: 1) Chi-Squared, 2) P Values, 3) Standard Error, 4) Confidence levels for each beta