Apache Cassandra: distributed management of large databases
If you need to manage large amounts of data on the order of several terabytes or even petabytes, traditional database systems will not be up to the task. In this case, you need special big data applications that are easily scalable, since it’s often difficult to predict the actual volume of data in advance. One of the most popular modern examples of such systems is Cassandra, an open-source solution originally developed for Facebook.
What is Apache Cassandra?
Apache Cassandra is an open-source database management system (DBMS) for very large yet structured databases. Thanks to easy scalability, these databases can be distributed across different clusters, which is why Cassandra is not bound to a single server.
Cassandra is a column-oriented NoSQL database. In this case, NoSQL means “Not only SQL” and not “no SQL”. When it comes to processing large amounts of data, NoSQL structures offer significant advantages over typical SQL databases because they are not bound by the restrictions of the query language SQL (Structured Query Language). Apache Cassandra has its own query language called Cassandra Query Language (CQL), which is similar to SQL, but is much preferred by developers because it is tailored to the special features of Cassandra.
As a NoSQL database, Cassandra relies on redundancy to ensure high resilience. By contrast, relational databases frequently encounter problems when replicating data.
Cassandra was originally developed by Avinash Lakshman and Prashant Malik at Facebook and was first released in 2008. In 2009, the Apache Software Foundation, one of the most important open source developer communities, included the project as a sub-project in the Apache Incubator. In February 2011, Apache Cassandra graduated to a top-level project in the Apache Software Foundation, alongside other popular projects such as Apache HTTP Server, Solr search server, the Kafka messaging platform or OpenOffice, which is the most well-known Apache project.
Along with the original developers, other big companies such as IBM, Twitter, and Rackspace, one of the largest IT service providers in the United States, contribute to Cassandra. One major contributor to the project is DataStax, a company specializing in subscription-based support, installation assistance, and training courses in the Cassandra database. DataStax contributes 80% of Cassandra’s open-source releases and also offers DataStax Enterprise, a commercial database solution built on the freely available Cassandra system.
According to the DB-Engines Ranking, Apache Cassandra is currently the most popular column-oriented database and has outperformed big competitors like Microsoft Azure Cosmos DB or Google Cloud Bigtable.
Cassandra: core functions
As a truly distributed system, Cassandra does not use a master. All clusters have equal permissions and can process every database request, which significantly increases performance. Data is distributed across nodes. The system can also be easily scaled by simply adding more nodes. After installing Cassandra, all you have to do is distribute the configuration files to the new nodes. Cassandra provides tools for this.
Apache Cassandra features a configurable replication system to ensure resilience and recovery of data in the event of a failure. Fault tolerance is minimized because the data is automatically replicated between the nodes. Failed nodes can be easily replaced. The system remains available for requests at all times.
Cassandra also offers high availability and partition tolerance. According to the CAP theorem in computer science, it is impossible to guarantee consistency, availability, and partition tolerance at the same time. Consistency, meaning that all nodes see the same data at all times, has the lowest priority in many big data systems. After a failure, consistency can be quickly restored through data recovery, whereas the other two properties must be ensured at all times.
Cassandra databases support the MapReduce programming model developed by Google for calculations involving large amounts of data in distributed systems. The proprietary query language CQL (Cassandra Query Language) is designed especially for the data structures of Cassandra.
What are the benefits of Apache Cassandra?
One of the main advantages of Cassandra is that it provides easy scalability with very high resiliency – two fundamental requirements for big data applications. Cassandra is horizontally scalable, which means you can increase the capacity and performance of the system by adding more nodes. This is the opposite of vertical scaling, where you add more powerful CPUs and larger hard drives to a single database server when you need to increase performance or capacity. Horizontal scaling is the cheaper solution in most cases since you can use commercially available server hardware.
Cassandra’s data model is based on multidimensional hash tables where each row can have any number of columns. Unlike columns in a traditional database table, these columns do not have to be the same in every row. Apache Cassandra also has a clear speed advantage when compared to other NoSQL databases in benchmark analyses and real-life application scenarios.
Where is Apache Cassandra used?
One of the main goals in developing Cassandra was to help Facebook users to search their inboxes more easily. The corporate giant used a cluster of over 150 individual nodes to power this feature. It’s no coincidence that Cassandra, which resembles Amazon Dynamo and Google Bigtable in its basic structures, is now very popular with providers of large social networks in which vast amounts of data are shared between users. Along with Twitter, Instagram, and Spotify, other big-name customers include the social bookmarking website Digg and social news aggregator Reddit.
Facebook has now switched from Cassandra to a proprietary solution that combines the HBase and HDFS database systems, both components of the Apache Hadoop framework.
Many other networks that handle large amounts of data use Cassandra both as a main database and as a secondary component for specific tasks. Examples include eBay, GitHub, Netflix, The Weather Channel, and the Large Hadron Collider at CERN, the European Organization for Nuclear Research (around 30,000 terabytes of data per year). Apple has one of the largest Cassandra installations, with 75,000 nodes.
Getting started with Apache Cassandra
Apache Cassandra runs on UNIX-like systems, preferably Linux servers. The Java Runtime Environment is also required because Cassandra is programmed in Java. Installation packages are stored on Apache servers as Debian or RPM packages. To install Cassandra, you add the corresponding repository. After installation, you create the usual data, cache and protocol directories and configure the cassandra.yaml file.
Cassandra has its own command line tools for administrator tasks. The most important utility is the Cassandra Query Language shell (cqlsh).
You can use the following command to view a list of all available commands:
cqlsh --help
The following YouTube video provides a clear introduction to Apache Cassandra:
DataStax offers OpsCenter, a web-based tool for visual management and monitoring of Cassandra systems.