Troubleshoot Log Processing and Archiving Issues

Last updated
Save as PDF

For supported software information, click here.

Log collector nodes in an Analytics cluster receive, store, and process logs received from Versa Operating System^TM (VOS^TM) branch and Controller devices. Log processing can be compromised when log files are not processed quickly enough or at all, when log files are not archived, and when a node is unintentionally configured to drop logs for a specific feature. The node may accumulate a large volume of unprocessed log files, referred to as a backlog. Backlogs can overfill disk resources and cause the node to slow down, drop logs, or fail. The node may also drop logs based on the settings for log limits and log retention.

This articles describes how to identify and resolve many backlog, log limit, data retention, and log archive issues.

Backlogs

The term backlog refers to log files that accumulate in the subdirectories in the /var/tmp/log/tenant-tenant-name directories on an Analytics node. These directories contain raw log files that are created by the log collector exporter (LCE) program. The Versa Analytics driver reads these files, identifies and prepares the data, and transfers it to the search engine and database nodes in the Analytics cluster. After log processing is complete, the Analytics driver moves the file to a backup directory. If this operation fails for any reason, the unprocessed files create a backlog.

A backlog can accumulate on an Analytics node for the following reasons:

Insufficient resources are allocated to an Analytics node, which slows down the operation of the Analytics driver.
The Versa Analytics driver is stuck or down.
Too many active LEF sessions are pinned to one log collector node, and so the the bulk of incoming logs are all sent to the same node.
Incoming log volume is much higher than the rate at which the Analytics driver can process the logs.

Log Limits

You can set daily log limits on an Analytics cluster, for the whole cluster or by tenant. The log count period begins at 00:00 UTC and lasts 24 hours. If log limits are exceeded in the period, the cluster generates an alarm, and unprocessed and newly received logs are not stored in the search engine datastore. Normal log processing resumes when the next 24-hour period begins. By default, no log limits are set.

Data Retention for Features

The Analytics cluster contains data configuration settings that determine whether logs for specific features or services are stored in the Analytics database and search engine datastores. If the data configuration is set to Off for a specific feature or service, any logs received for these items are dropped by the Analytics driver, and the charts and tables for these features have no data to display.

Troubleshoot Backlog Issues

To troubleshoot a backlog issue:

Log in to the shell on the Analytics node. Backlog issues can slow shell response, and so you might notice a lag during the login process.
For virtual machines (VMs) that are configured as log collector nodes, the underlying issue might be that not enough resources are allocated to process the incoming load. In this case, ensure the VM is configured as follows:
- Verify that enough cores and memory are assigned to the VM, according to the requirements in Hardware and Software Requirements for Headend.
- Ensure that hyperthreading is disabled on the host sever on which the VM is activated.
- The cores assigned to VMs should be dedicated (1:1), and there should not be any over subscription of cores on the server. For example, if the server has 32 cores and there are 3 VMs that are each assigned 16 cores, this is an oversubscription because cores are shared between VMs instead of being dedicated.
- The disk should be SSD. Using HDD disks can result in slow database read/write issues.
- The disk controller setting on the host server should be SCSI, not IDE.
Display the contents of the Analytics driver log file. Compare the values of the logs and time fields. Typically, the Analytics driver clears out 3,000 logs in under 1 second. If you see larger times, for example, if 3,000 logs take 10 seconds to process, this can be a symptom of a backlog.

admin@Analytics$ tail /var/log/versa/versa-van-driver.log
2023-09-20 10:14:34,335 - vandriver - INFO - [Tenant1] Files processed [files=0, filesErrors=0, logs=0, time=0.001 sec, logsRate=0/sec]
2023-09-20 10:14:34,335 - vandriver - INFO - Cleaning cache [3 vsn]
Switch privileges to user root.

Identify the amount of disk storage used by each tenant subdirectory in the /var/tmp/log directory. If the values displayed are close to or greater than 10 gigabytes, this indicates a significant backlog.

admin@Analytics$ sudo su root
root@Analytics# cd /var/tmp/log/*
root@Analytics# du -sBM *               
1M templateV0.txt
2M tenant-provider-org
2M tenant-Tenant1
2M tenant-Tenant10
1M tenant-Tenant11
1M tenant-Tenant12
2M tenant-Tenant2
2M tenant-Tenant3
1M tenant-Tenant4
2M tenant-Tenant5
1M tenant-Tenant6
1M tenant-Tenant7
2M tenant-Tenant8
2M tenant-Tenant9

Identify directories with log accumulation for a specific branch by issuing the du –sBM command for the tenant subdirectories. The following example output shows the logs for Tenant5. If any branch subdirectories use more than 200 MB of storage, this indicates a significant backlog. If the same branch displays large usage for every tenant, this indicates that the branch is exporting a large volume of logs.

root@Analytics# du -sBM tenant-Tenant5/*
1M tenant-Tenant5/backup
1M tenant-Tenant5/VSN0-SDWAN-Branch1
1M tenant-Tenant5/VSN0-SDWAN-Branch2
1M tenant-Tenant5/VSN0-SDWAN-Branch4
1M tenant-Tenant5/VSN0-SDWAN-Branch5
1M tenant-Tenant5/VSN0-SDWAN-Controller1
1M tenant-Tenant5/VSN0-SDWAN-Controller2
1M tenant-Tenant5/VSN0-versa

If the backup folder uses a large amount of space, the log archival might be failing. The log archive cron job should archive files to the /var/tmp/archive directory once an hour, so check the backup directory for log files older than 1 hour. For more information, see Troubleshoot Log Archive Issues, below.

root@Analytics# ls -lrth tenant-Tenant5/backup

If the nonbackup directories for the tenant use a large amount of disk space, list the timestamps on log files for branch devices, Branch1 and Branch2 in the following example output.

root@Analytics# ls -lrth tenant-Tenant5/VSN0-SDWAN-Branch1 | less
total 4
-rw-r--r-- 1 root root 1409 Sep 19 13:52 20230919T135233.txt.tmp
root@Analytics# ls -lrth tenant-Tenant5/VSN0-SDWAN-Branch2 | less
total 8
-rw-r--r-- 1 root root 7117 Sep 19 13:52 20230919T134733.txt.tmp
root@Analytics# exit
admin@Analytics$

If the timestamp on the most recent files indicates that they are several hours or several days old, this indicates that log files are not being processed by the Analytics driver. Restart the driver.

admin@Analytics$ sudo service versa-analytics-driver restart
versa-analytics-driver stop/waiting
versa-analytics-driver start/running, process 3

Check the number of active (LEF) connections for all nodes in the cluster. Log in to each node, and then issue the following commands. The Clients Active field displays the number of LEF connections for the node.

admin@Analytics$ telnet 0 9100
LCED-DBG> show lced stats
Local Collectors
===========================
                 Local Collector : collector1(0)
                  Clients Active : 124
               Clients Connected : 124
     Log Template Received Count : 1219
         Log Data Received Count : 8175878
...
LCED-DBG> exit

If the current node is more heavily loaded with LEF connection than the other cluster nodes, restart the LCE service on the current node to redistribute the connections. Note that to load-balance LEF connections, you must configure the VOS devices to export logs to the ADC on a Controller node. If VOS devices export logs directly to the IP address of the log collector node, restarting the LCE service has no effect on the distribution of LEF connections.

admin@Analytics$ sudo service versa-lced restart

You can wait for the Analytics driver to process the backlog, or you can choose to delete the backlog. It can take hours for the Analytics driver to process a backlog. For backlogs of 100 GB or more, it commonly takes 48 to 72 hours provided that there are no issues with CPU resources available to the Versa Analytics driver.
- To assist the Analytics driver in processing the backlog, continue with the next step in this procedure.
- To manually delete the backlog, do not continue with this procedure. Instead, follow the procedure in Delete a Backlog, below.
To assist in processing the backlog, temporarily reduce the maximum number of LEF connections to the local collector on the node. This allows the Analytics driver to catch up on processing the backlog. You can set the maximum number of connections to 0 to stop all incoming LEF connections.

In the following example, the local collector is collector1 and the maximum number of connections is lowered to 100. (The default value is 512.)

admin@Analytics$ cli
admin@Analytics> configure
admin@Analytics% set log-collector-exporter local collectors collector1 max-connections 100
admin@Analytics% commit
admin@Analytics% exit

When the backlog clears, ensure that you restore the max-connections setting to its former value:

admin@Analytics% set log-collector-exporter local collectors collector1 max-connections 512

To help prevent a log backlog in the future, check for flow log overload issues. See Troubleshoot Flow Log Overload Issues, below.

Delete a Backlog

Instead of waiting for the Versa Analytics driver to process a backlog, you can choose to remove the backlog files manually. In normal operations, processed log files are moved to backup directories where they are automatically saved to archive. Deleting the files bypasses this process.

Warning: If you delete backlog files manually, the logs in the deleted files are not automatically archived or incorporated into the database or search engine datastore in the cluster.

To delete a backlog:

Log in to the shell on the Analytics node containing the backlog.
Shut down the Analytics driver. This ensures that the Analytics driver does not attempt to modify the log files in the /var/tmp/log directory while you are deleting them.
```
admin@Analytics$ vsh stop
```
Issue the find command with the delete option. You can issue variations on the command depending on which backlog files you want to delete.

The following example deletes all the backlogs in the /var/tmp/log directory, for all organizations and VOS devices, for the year 2021. The find command typically takes several minutes to complete.

admin@Analytic$ sudo su root
root@Analytics# find /var/tmp/log/tenant-*/VSN0-* -name "2021*.txt" -delete
root@Analytics# exit
admin@Analytics$

The following example deletes the backlogs for organization Tenant1 for VOS device Branch1.

admin@Analytics$ sudo su root
root@Analytics# find /var/tmp/log/tenant-Tenant1/VSN0-Branch1 -name "2021*.txt" -delete
root@Analytics# exit
admin@Analytics$

Restart the Analytics driver.

admin@Analytics$ vsh start

Troubleshoot Flow Log Overload Issues

When branch devices are configured to export flow logs, access policy logs, or application statistics, large volumes of logs are forwarded to Analytics nodes. In this case, even normal branch traffic can cause a backlog. To reduce the volume of these logs, follow the flow log collection recommendations in Versa Analytics Scaling Recommendations.

Troubleshoot Issues with Exceeding Log Limits

When an Analytics cluster exceeds a daily log limit, globally or for a specific tenant, only critical search logs are ingested into the search engine datastore until the 24-hour time period resets at 00:00 UTC. If you increase the limit, the ingestion of non-critical logs resumes until the new limit is exceeded. You can configure log limits from the Analytics > Administration > Configurations > Settings > Data Configurations screen. See Configure Search Engine Log Storage Limits in Versa Analytics Scaling Recommendations.

When a daily log limit is exceeded, the Analytics cluster generates an alarm log. You can view the alarms at the Analytics > Administration > System Status > Alarms screen. Look for alarms containing "global daily limit" or "tenant limit" in the Description field.

Troubleshoot No Reports for Some Features

If charts or tables for a specific feature or service display no data even though branch devices are forwarding logs to the Analytics cluster, check the data configuration settings for the feature or service. When the data configuration value is set to Off, no data for the feature is stored in the database or search engine.

To verify that data for a specific feature or service does not display from the Director GUI:

In Director view, select the Analytics tab.
Select an Analytics cluster node. For Releases 22.1.1 and later, hover over the Analytics tab and then select a node. For Releases 21.2 and earlier, select a node in the horizontal menu bar. You can administer all cluster nodes from any node in the cluster.
Look up a chart or table for the feature in the Analytics dashboards or log screens.

In the following example, the user selects Dashboard > SD-WAN and then selects VOS device Bangalore-ECT-DC-Active from second drop-down menu. The device details dashboard for the site displays. The user selects the SLA Metrics tab, and notes that the charts contain no data.

To verify and change data configuration settings for a feature or service from the Director GUI:

In Director view, select the Analytics tab.
Select an Analytics cluster node. For Releases 22.1.1 and later, hover over the Analytics tab and then select a node. For Releases 21.2 and earlier, select a node in the horizontal menu bar.
Select Administration > Configuration > Settings, and then select the Data Configurations tab.
If the feature or service data is stored in the search engine, click Search Data Configurations to expand the screen and view the options. If it is stored in the database, click Analytics Data Configurations. For information about which data configuration parameter maps to which feature, see Apply Log Export Functionality.
In the Active column, ensure that the feature is set to On so that logs are forwarded for storage in the database or search engine.

Troubleshoot Log Archive Issues

The Versa Analytics driver processes incoming log files in subdirectories the /var/tmp/log directory, after which it moves them to various backup directories. The log archive cron job periodically archives the files in the backup directories, which involves compressing and moving the files to the /var/tmp/archive directory. If the cron job is absent, misconfigured, or encounters an error, the files are never archived.

After the files are archived, there is no built-in mechanism to automatically delete the files in the /var/tmp/archive directory. If allowed to accumulate over long periods, archive files can occupy too much disk space and slow down normal operations on a log collector node. You can simply delete the files. However, some enterprises must retain log archives for auditing purposes. To retain archives, periodically move them to more permanent storage, such as an external server or backup medium.

This section describes how to identify and resolve log archive issues.

Troubleshoot Log Archive Cron Job Issues

Backup directories retain logs for at most an hour before the log archive cron jobs moves them to the /var/tmp/archive directory. If the cron job is not running, or if there is an issue with the archiving, files can accumulate in the backup directories.

To verify and troubleshoot the log archive cron job:

Check that the log archive cron job has been configured.

admin@Analytics$ sudo cat /etc/cron.d/log-archive
# Every @hourly run the log  archive script
0 *   * * *   root python /opt/versa/scripts/van-scripts/log-archive.py --src /var/tmp/log --dst /var/tmp/archive --dir backup --job-name logarchive

If the command returns "No such file or directory", the archive cron job does not exist. Continue with Step 2.

If the command returns the file contents as displayed in the example above, continue with Step 3.

The log-archive file contents should match the example output exactly, unless you have deliberately modified the cron job to run in different time increments or use a different source or destination directory.

If the log-archive cron job file does not exist, create it:

admin@Analytics$ sudo /opt/versa/scripts/van-scripts/log-archive-start /var/tmp/log /var/tmp/archive hourly active log-archive

Log archiving should then begin taking place automatically every hour, starting at the next hour.

The cron job creates a log file named versa-log-archive.log. Display the most recent entries:

admin@Forwarder$ tail -20 /var/log/versa/versa-log-archive.log

If the log archive job is running but unable to archive, the log file contain "starting archive" entries every hour but does not contain entries that show branch folders being archived or does not contain "finished log-archiving" entries.

Delete Log Archives

Note that deleting archive files does not affect the database or search engine datastores.

To delete log archives:

If required, copy the files from the /var/tmp/archive directory and its subdirectories to a backup server or storage medium.
Issue the following commands from the shell:

admin@Analytics$ sudo su
root@Analytics# cd /var/tmp/archive 
root@analytics# rm -rf *
root@Analytics# exit
admin@Analytics$

Supported Software Information

Releases 20.2 and later support all content described in this article.

Additional Information

Apply Log Export Functionality
Hardware and Software Requirements for Headend
Overview of Analytics Troubleshooting
Manage Versa Analytics Log Archives
Versa Analytics Scaling Recommendations