Troubleshoot Analytics Database Issues
For supported software information, click here.
For both the DSE and Fusion Analytics platforms, the Analytics database for Analytics clusters is implemented in Apache Cassandra. For the DSE Analytics platform, the search engine is also implemented in Cassandra. Cassandra may switch to a Down state on a node when it encounters certain issues. This articles describes how to troubleshoot a node when the database has switched to a Down state.
To troubleshoot an Analytics node when the database has switched to a Down state:
- Verify that the database is down.
- Identify and resolve disk full conditions.
- Identify and resolve reachability issues.
- Restart the database.
Verify that the Database Is Down
To verify that the database is down, issue the nodetool status command from the shell. The node status displays in the area highlighted in yellow in the example below. D indicates that the database is down.
admin@Analytics$ nodetool status
Datacenter: Analytics
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack Rack
UN 127.0.0.1 9.16 GB 100.0% c155ed9e-b192-42df-bbc2-9c89f09cb6ad 0 RAC1
...
You can view the /var/log/cassandra/system.log file to check for database errors.
admin@Analytics$ cat /var/log/cassandra/system.log
Identify Disk Full Conditions
Database data is stored in the root filesystem, in subdirectories in the /var/lib/cassandra directory. This section describes how to determine root filesystem disk usage and the proportion of the root filesystem that is used by the database.
To display the amount of disk used by the root filesystem, issue the df -kh / command. For example:
admin@Analytics$ df -kh / Filesystem Size Used Avail Use% Mounted on /dev/mapper/system-root 71G 31G 37G 46% /
In this example, the root filesystem uses 31 GB of disk, and the total available disk space is about 71 GB. If the output indicates that 70 percent or more of the filesystem is being used, it is likely that the database has failed during the compaction process. Compaction is a required operation that is periodically performed to optimize database tables. This operation temporarily increases disk usage by a large percentage, sometimes 40 percent or more. You can reduce the size of the database or increase disk storage to allow for compaction.
Note: It is recommended that you keep disk space at or below 60 percent usage to leave space for the database compaction process. For more information, see Troubleshoot Cassandra Database Filling Up Disk in Troubleshoot Analytics Disk Storage Issues.
To display the amount of disk used by the Analytics database, issue the vsh dbstatus or du –sh /var/lib/cassandra/data command. For example:
admin@SDWAN-Versa-Analytics$ vsh dbstatus Datacenter: Analytics ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns (effective) Host ID Token Rack UN 127.0.0.1 9.22 GB 100.0% c155ed9e-b192-42df-bbc2-9c89f09cb6ad 0 RAC1 admin@SDWAN-Versa-Analytics$ du -sh /var/lib/cassandra/data 9.3G /var/lib/cassandra/data
To reduce the amount of disk used by the database, see Troubleshoot Cassandra Database Filling Up Disk in Troubleshoot Analytics Disk Storage Issues.
After you have reduced the percentage of disk used by the database, you can restart it, as described in Restart the Cassandra Database, below.
Troubleshoot Reachability Issues
To identify whether there are reachability issues between the nodes in the cluster, ping the listen address of each node and confirm that all required ports are open between the nodes. Listen addresses are used for internal communication between cluster nodes and are configured in the clustersetup.conf file during initial software configuration. For more information, see Set Up Analytics in Perform Initial Software Configuration. For a list of ports that must be open, see Firewall Requirements.
To verify that a port is reachable between nodes, issue the nc –zvw3 peer-node-listen-address port-number command. The following example shows that port 8983 at IP address 192.10.10.45 is reachable, but port 9042 is not reachable, from the current node:
admin@Analytics$ nc -zvw3 192.10.10.45 8983 Connection to 192.10.10.45 9042 port [tcp/*] succeeded! admin@Analytics$ nc -zvw3 192.10.10.45 9042 nc: connect to 192.10.10.45 port 9042 (tcp) failed: Connection refused
Resolve any port connection problems, and then restart the database using the procedure in Restart the Cassandra Database, below.
Troubleshoot Transient Errors
If the disk is not full and there are no reachability issues, the database might have a transient error. In this case, restart the database as described below.
Restart the Cassandra Database
The commands you use to restart the Cassandra database depend on whether the cluster is using the DSE or Fusion Analytics platform. To determine which platform a node is running, issue the dse –v command. If the output shows dse-4.5.x or dse-4.8.x, the node is running DSE, If an error message displays, this indicates that the node is running Fusion.
To restart the Cassandra database:
- To restart the database in DSE:
admin@Analytics$ sudo service monit stop
admin@Analytics$ sudo service dse stop
admin@Analytics$ sudo service dse start
admin@Analytics$ sudo service monit start
(wait a few minutes)
admin@Analytics$ nodetool status
Datacenter: Analytics
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack Rack
UN 127.0.0.1 9.16 GB 100.0% c155ed9e-b192-42df-bbc2-9c89f09cb6ad 0 RAC1
...
The restart process can take up to 20 minutes. After the database restarts, the status should change to U.
- To restart the database in Fusion:
admin@Analytics$ vsh db-restart
(wait a few minutes)
admin@Analytics$ vsh dbstatus
Datacenter: Analytics
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack Rack
UN 127.0.0.1 9.16 GB 100.0% c155ed9e-b192-42df-bbc2-9c89f09cb6ad 0 RAC1
...
The restart process can take up to 20 minutes. After the database restarts, the status should change to U.
- If you are unable to restart the database, there might be corruption in database commits. Try to clear any pending database commits by issuing the following commands:
admin@Analytics$ sudo su root@Analytics# cd /var/lib/cassandra/commitlog root@Analytics# rm -rf * root@Analytics# cd /var/lib/cassandra/saved_caches root@Analytics# rm -rf * admin@Analytics$
- After clearing the commits, restart the database again.
- If you are still unable to restart the database, contact Versa Networks Customer Support. When you open the case, attach the file /var/log/cassandra/system.log.
Supported Software Information
Releases 20.2 and later support all content described in this article.