adaptive service monitor

 

No two servers are alike. No two servers will ever experience the same conditions. For those evolving servers, we have created the Adaptive Service Monitor™ ("asm"), a statistical monitor that collects, analyzes, and adapts the server to the changing needs of its users.

How it works

asm periodically collects samples from selected services at arbitrary intervals. These data are then compared against historical trends with a focus on significant changes between samples. If significance is found between the samples (α = 0.05), then further analysis is done to determine the cause of a bottleneck. asm analyzes process information, memory utilization, disk I/O, and network throughput, then adapts the server by lessening the burden of the conflicting service. After exploratory tuning is performed, server information is recorded and reanalyzed to determine a success rate. This information is used for future decision making. As a result, the server stays healthy during peak hours and constantly retunes itself as Web sites grow.

An example

In order to better understand how asm operates, let us step through a brief example. This is a guided tour that steps through the basic process in asm given two variables.

  1. Consider that we have a MySQL server and Web server running in tandem on 2 GB of RAM. MySQL is consuming 1 GB of RAM, and the Web server, the remaining 1 GB. The rest of the memory is paged to disk through a swap file. Paging reduces server performance by shifting memory from the faster RAM to the slower hard disk.

  2. Under an extraordinary circumstance, a site suffers from the "Reddit Hug", meaning that it receives an influx of page requests beyond normal tolerance levels. The Web server scales up the number of concurrent connections in an attempt to cope with this drastic change.

  3. An average system would buckle under the pressure and become progressively slower during this timeframe. asm analyzes the server loads and discovers a spike in load averages. After exploratory research, it determines that the Web server has grown in memory utilization from 1 GB to 1.5 GB with 1/2 GB of memory being paged to disk, thus significantly reducing performance.

  4. asm analyzes other services and notes that MySQL's query and table caches are being fairly underutilized, meaning that an unnecessary chunk of memory is being allocated to MySQL's buffers for future caching that never occurs and is unlikely to occur in the near-future in a normal environment.

  5. MySQL is retuned to reduce the caching buffers thereby freeing up memory to be used by other applications, such as the Web server. The Web server now has an extra 512 MB of RAM available, which it uses to eliminate paging by shifting memory processing back to the much faster RAM. Load averages subside to normal tolerance levels and asm records the response for future decision making.

Miscellaneous

What is asm in a nutshell?

An adaptive service monitor capable of analyzing server performance and tuning to ensure peak throughput is sustained.

Who developed asm?

asm is developed by Apis Networks for use on Apis Networks servers.

What type of statistics are used?

We use a one-sided t-distribution to compare data. Data is collected on a per-server basis, that is to say calculations are not compared against all servers, but rather on the server that asm runs. Signficance is evaluated at α = 0.05 once adequate data has been mined.

Which services are monitored?

asm is capable of monitoring and tuning kernel-level metrics such as disk I/O and swaps, per-process CPU utilization, process counts, Web server throughput, sendmail, MySQL, and PostgreSQL. asm is able to monitor and restart additional services as the need arises, but cannot retune on-demand.

What are some examples of dynamic retuning?

asm can switch kernel elevators (2.6), modify table/query cache allowances in MySQL, adjust WAL and page costs in PostgreSQL, toggle keepalives in Apache, change readahead rates on drives, and even tweak swappiness.

Is asm guaranteed to keep the server up?

asm runs as a service ontop of the Linux kernel. Under rare, unavoidable circumstances that force an underlying kernel panic, asm is unable to fully recover the server. However, for most cases where load averages gradually increase to levels that the server can still operate under, asm will work responsively in reducing server loads.

When will asm be available on the servers?

Parts of asm is already live on our servers. How do you think we keep such a solid and high uptime? asm is built around a modularized framework, which allows us to progressively deliver new changes to enhance overall system stability. Originally, asm performed basic threshold checks, but today it has evolved from a series of range checks to proactive screenings to prevent even the rarest of problems from manifesting; once is enough.