Nexpose can spider Web sites to discover their directory structures, default directories, the files and applications on their servers, broken links, inaccessible links, and other information.
The application then analyzes this data for evidence of security flaws, such as SQL injection, cross-site scripting (CSS/XSS), backup script files, readable CGI scripts, insecure password use, and other issues resulting from software defects or configuration errors.
Some built-in scan templates use the Web spider by default:
You can adjust the settings in these templates. You can also configure Web spidering settings in a custom template. The spider examines links within each Web page to determine which pages have been scanned. In many Web sites, pages that are yet to be scanned will show a base URL, followed by a parameter directed-link, in the address bar.
For example, in the address www.exampleinc.com/index.html?id=6, the ?id=6 parameter probably refers to the content that should be delivered to the browser. If you enable the setting to include query strings, the spider will check the full string www.exampleinc.com/index.html?id=6 against all URL pages that have been already retrieved to see whether this page has been analyzed.
If you do not enable the setting, the spider will only check the base URL without the ?id=6 parameter.
To gain access to a Web site for scanning, the application makes itself appear to the Web server application as a popular Web browser. It does this by sending the server a Web page request as a browser would. The request includes pieces of information called headers. One of the headers, called User-Agent, defines the characteristics of a user’s browser, such as its version number and the Web application technologies it supports. User-Agent represents the application to the Web site as a specific browser, because some Web sites will refuse HTTP requests from browsers that they do not support. The default User-Agent string represents the application to the target Web site as Internet Explorer 7.
Configure general Web spider settings:
Note: Including query strings with Web spidering check box causes the spider to make many more requests to the Web server. This will increase overall scan time and possibly affect the Web server's performance for legitimate users.
This test helps to reduce the risk of dangerous attacks via malicious code stored on Web servers. Enabling it may increase Web spider scan times.
Note: Changing the default user agent setting may alter the content that the application receives from the Web site.
If you are unsure of what to enter for the User-Agent string, consult your Web site developer.
(Optional) Enable the Web spider to check for the use of weak credentials:
As the Web spider discovers logon forms during a scan, it can determine if any of these forms accept commonly used user names or passwords, which would make them vulnerable to automated attacks that exploit this practice. To perform the check, the Web spider attempts to log on through these forms with commonly used credentials. Any successful attempt counts as a vulnerability.
Note: This check may cause authentication services with certain security policies to lock out accounts with these commonly used credentials.
Configure Web spider performance settings:
This option sets the maximum number of unique host names that the spider may resolve. This function adds substantial time to the spidering process, especially with large Web sites, because of frequent cross-link checking involved. The acceptable host range is 1 to 500.
Limiting directory depth can save significant time, especially with large sites. For unlimited directory traversal, type 0 in the field. The default value is 6.
Note: If you run recurring scheduled scans with a time limit, portions of the target site may remain unscanned at the end of the time limit. Subsequent scans will not resume where the Web spider left off, so it is possible that the target Web site may never be scanned in its entirety.
A time limit prevents scans from taking longer than allotted time windows for scan jobs, especially with large target Web sites. If you leave the default value of 0, no time limit is applied. The acceptable range is 1 to 500.
This is a time-saving measure for large sites. The acceptable range is 1 to 1,000,000 pages.
Note: If you set both a time limit and a page limit, the Web spider will stop scanning the target Web site when the first limit is reached.
Configure Web spider settings related to regular expressions:
The application reports field names that are designated to be sensitive as vulnerabilities: Form action submits sensitive data in the clear. Any matches to the regular expression will be considered sensitive data field names.
Configure Web spider settings related to directory paths:
Robots.txt is a convention that prevents spiders and other Web robots from accessing all or part of Web site that are otherwise publicly viewable.
Note: Scan coverage of any included bootstrap paths is subject to time and page limits that you set in the Web spider configuration. If the scan reaches your specified time or page limit before scanning bootstrap paths, it will not scan those paths.
Example: /myapp. Separate multiple entries with commas. If you leave the field blank, the spider does not include bootstrap paths in the scan.
If you specify excluded paths, the application does not attempt to spider those URLs or discovery any vulnerabilities or files associated with them. If you leave the field blank, the spider does not exclude any paths from the scan.
Configure any other scan template settings as desired. When you have finished configuring the scan template, click Save.
The Web spider crawls Web servers to determine the complete layout of Web sites. It is a thorough process, which makes it valuable for protecting Web sites. Most Web application vulnerability tests are dependent on Web spidering.
Nexpose uses spider data to evaluate custom Web applications for common problems such as SQL injection, cross-site scripting (CSS/XSS), backup script files, readable CGI scripts, insecure use of passwords, and many other issues resulting from custom software defects or incorrect configurations.
By default, the Web spider crawls a site using three threads and a per-request delay of 20 ms. The amount of traffic that this generates depends on the amount of discovered, linked site content. If you’re running the application on a multiple-processor system, increase the number of spider threads to three per processor.
A complete Web spider scan will take slightly less than 90 seconds against a responsive server hosting 500 pages, assuming the target asset can serve one page on average per 150 ms. A scan against the same server hosting 10,000 pages would take approximately 28 minutes.
When you configure a scan template for Web spidering, enter the maximum number of directories, or depth, as well as the maximum number of pages to crawl per Web site. These values can limit the amount of time that Web spidering takes. By default, the spider ignores cross-site links and stays only on the end point it is scanning.
If your asset inventory doesn’t include Web sites, be sure to turn this feature off. It can be very time consuming.