Recursive Link Checker (or Web Crawler)

At a previous company where I worked in QA, there was a need to quickly identify problems with deployments and page changes. The department only had 2 people in it, myself and someone else so doing this quickly ate up alot of our time and also left unknown gaps that would be missed. Since we had SilkTest and Silk Performer, I developed a small script that would recursively loop through each link in our site. Ths script does this by looking at the raw HTML and finding all of the links present on the page. It loads up an array and starts at link 1 by clicking it. It finds all the links on that page and adds them to the array and so forth.

I simply provided it with a starting point and excluded some links (such as loging the user out) from the test and away it went. The script will log each link it visits along with the Page Title on a successful hit. It will also log any errors it encounters along with the link it attempting to visit. Very simple and this script still yields results for every release.

use "WebAPI.bdh"
use "kernel.bdh"

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Global Paramaters and Variables
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  psSiteAdminStaffColum1: Staff ID to login to site
//  aDoNotVisit: array that contains links not to visit or already visited.
//  sSiteId: Site ID used to login to the site for testing.
//  sUrlPostBin: Base domain name for testing
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

dclparam
  aDoNotVisit: array [999999] of string;
  sSiteId: string;
  sUrlPostBin: string;

dclfunc

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  CheckTitle – Function to check for errors loading page or if a page was loaded that is an error page.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  nPageNotFound: Flag that is set if the error "Page Not Found" is encountered.
//    1  :  Page not found encountered.
//    0  :  Page not found not encountered.
//  nAccessDenied: Flag that is set if the error "Access not Authorized" is encountered.
//    1  :  Access not authorized encountered.
//    0  :  Access not authorized not encountered.
//  nError
//    1  :  Error page encountered (Sorry for the Inconvenience).
//    0  :  Error not encountered.
//  bURL: Boolean value for if page had trouble loading.
//        TRUE :  Page load unsuccessful.
//        FALSE:  Page loaded successfully.
//  sLink: Link to visit
//  bError: Boolean to track if a known error is encountered.
//        TRUE :  Known error encountered.
//        FALSE:  Unknown error encountered.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

function CheckTitle ( nPageNotFound : number ; nAccessDenied : number ; nError : number; bURL : boolean ; sLink : string )
  var
    bError: boolean;
  begin
    //Reset error encountered for page to false.
    bError := FALSE;

    //Check to see if the page was not found based on text pulled from the page.
    if nPageNotFound <> 0 then
      bError := TRUE;
      writeln;
      writedata (" -> ERROR (Page Not Found) on " + sLink);
      writeln;
    end;

    //Check to see if the page encountered an error.
    if nError <> 0 then
      bError := TRUE;
      writeln;
      writedata (" -> ERROR (Error Encountered) on " + sLink);
      writeln;
    end;

    //Check to see if the page was denied based on text pulled from the page.
    if nAccessDenied <> 0 then
      bError := TRUE;
      writeln;
      writedata (" -> ERROR (Access denied) on " + sLink);
      writeln;
    end;

    //Check to see if bURL was flagged as false (error loading page) and bError was
    // flagged as false (other error conditions did not catch this)
    if (bURL = FALSE) and (bError = FALSE) then
      writeln;
      writedata (" -> ERROR (UNKNOWN) on " + sLink);
      writeln;
    end;

  end CheckTitle;

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  CheckLinksForExclusion – Checks current link against links not to visit and links
//    already visited.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  sLink: Link to visit.
//  bExclude: Boolean to determine if the link should be visited.
//         TRUE :  Link should be added to aDoNotVisit.
//         FALSE :  Link should not be added to aDoNotVisit.
//  nDoNotVisit: Ending counter to loop through all existing aDoNotVisit links.
//  i: Beginning counter to loop through all existing aDoNotVisit links.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

function CheckLinkForExclusion(sLink : string) : boolean
  var
    bExclude: boolean;
    nDoNotVisit: number;
    i: number;
  begin
    nDoNotVisit  := 1;
    bExclude := FALSE;

    //Count up links that are in the "Do Not Visit" list.
    while (aDoNotVisit[nDoNotVisit] <> "") do
      nDoNotVisit := nDoNotVisit + 1;
    end;

    //Check link against the "Do Not Visit" list.
    for i := 1 to nDoNotVisit do
      if sLink = aDoNotVisit[i] then
        bExclude := TRUE;
      end;
    end;

    //Check link against logout.
    if (WebPageQueryLink(sLink) = WebPageQueryLink("logout.ssp")) then
      bExclude := TRUE;
    end;

    //Check link against PDF button.
    if (WebPageQueryLink(sLink) = WebPageQueryLink(".sap")) then
      bExclude := TRUE;
    end;

    //Check link against mailto button.
    if (WebPageQueryLink(sLink) = WebPageQueryLink("mailto")) then
      bExclude := TRUE;
    end;

    if (WebPageQueryLink(sLink) = WebPageQueryLink("item.xls")) then
      bExclude := TRUE;
    end;

    //Check link against the documentation main files.
    //(we dont want to navigate each link inside of the document)
    if ((WebPageQueryLink(sLink) = WebPageQueryLink("PSHelp")) or (WebPageQueryLink(sLink) = WebPageQueryLink("docsPerformance"))) then
      bExclude := TRUE;
    end;

    //Check link against the documentation main files.
    //(we dont want to navigate each link inside of the document)
    if ((WebPageQueryLink(sLink) = WebPageQueryLink("ASHelp")) or (WebPageQueryLink(sLink) = WebPageQueryLink("docsAchievement"))) then
      bExclude := TRUE;
    end;

    //Check link for an Excel file. Silk Performer does not seem to handle opening
    // multiple of these well.
    if ((WebPageQueryLink(sLink) = WebPageQueryLink(".xls")) OR (WebPageQueryLink(sLink) = WebPageQueryLink(".csv"))) then
      bExclude := TRUE;
    end;

    //Check link for a PDF or RTF file. Silk Performer does not seem to handle opening
    // multiple of these well.
    if ((WebPageQueryLink(sLink) = WebPageQueryLink(".pdf")) OR (WebPageQueryLink(sLink) = WebPageQueryLink(".rtf"))) then
      bExclude := TRUE;
    end;

    CheckLinkForExclusion := bExclude;
  end CheckLinkForExclusion;

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  VisitLink – Log the link you are visiting and click the link.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  sLink: Link to visit.
//  nPageNotFound: Flag that is set if the error "Page Not Found" is encountered.
//    1  :  Page not found encountered.
//    0  :  Page not found not encountered.
//  nAccessDenied: Flag that is set if the error "Access not Authorized" is encountered.
//    1  :  Access not authorized encountered.
//    0  :  Access not authorized not encountered.
//  bURL: Boolean value for if page had trouble loading.
//    TRUE :  Page load unseccessful.
//    FALSE:  Page loaded successfully.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

function VisitLink ( sLink : string )
 var
   nPageNotFound : number;
   nAccessDenied : number;
   nError : number;
   bURL : boolean;
   sPageTitle : string;
 begin

   print ("Visit Link");
   //Check for Page Not Found text on page.
   WebVerifyHtml("The page cannot be found", NULL, WEB_FLAG_IGNORE_WHITE_SPACE | WEB_FLAG_EQUAL | WEB_FLAG_CASE_SENSITIVE, NULL,
     SEVERITY_INFORMATIONAL, nPageNotFound);
   //Check for Access Not Authorized text on page.
   WebVerifyHtml("Access Not Authorized", NULL, WEB_FLAG_IGNORE_WHITE_SPACE | WEB_FLAG_EQUAL | WEB_FLAG_CASE_SENSITIVE, NULL,
     SEVERITY_INFORMATIONAL, nAccessDenied);
   //Check for Error text on page.
   WebVerifyHtml("Sorry for the Inconvenience", NULL, WEB_FLAG_IGNORE_WHITE_SPACE | WEB_FLAG_EQUAL | WEB_FLAG_CASE_SENSITIVE, NULL,
     SEVERITY_INFORMATIONAL, nError);

   //Get Web Page Title
   WebParseHtmlTitle(sPageTitle, STRING_COMPLETE);

   //Click on link, store result in bURL (TRUE = page load successful,
   // FALSE = page load unsuccessful
   bURL := WebPageURL ( sLink );

   CheckTitle ( nPageNotFound, nAccessDenied, nError, bURL, sLink );

   writeln;
   writedata (" -> Visiting link " + sLink + " titled " + sPageTitle);
   writeln;
 end VisitLink;

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  GetLinksOnPage – Query page for all links that can be clicked.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//  sReturnLink : Link to return to once all links on page are visited.
//  i,j : Loop counter.
//  nLinksOnPage : Total number of links found on page.
//  sBaseHref : String used to filter out links not inside the domain.
//  sLink : Link to visit.
//  bExclude : Boolean to determine if the link should be visited.
//    TRUE  :  Link should be added to aDoNotVisit.
//    FALSE :  Link should not be added to aDoNotVisit.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

function GetLinksOnPage ( sReturnLink : string optional ): number
  var
    i : number;
    j : number;
    nLinksOnPage : number;
    sBaseHref : string;
    sLink : string;
    bExclude : boolean;
  begin
    sBaseHref                           :=  "yourbaseurlhere";
    j                                   :=  1;

    print ("GetLinksOnPage");
    //Count all links on page.
    while  (WebPageQueryLink( sBaseHref, nLinksOnPage ) > 0) do
      nLinksOnPage := nLinksOnPage + 1;
    end;
    print ("LinksOnPage: " + String(nLinksOnPage));

    if nLinksOnPage > 1 then
      for i := 1 to nLinksOnPage do
        //Get Next Link Name and URL.
        WebPageQueryLink( sBaseHref, i , NULL , NULL , sLink );

        //Check to see if we have visited this link before.
        bExclude := CheckLinkForExclusion ( sLink );
        //If the link is not on the Do Not Visit list, click the link,
        //add it to the list and get the next set of links.
        if bExclude = FALSE then
          //Determine how many links are on the Do Not Visit list.
          while ( aDoNotVisit[j] <> "" ) do
            j := j + 1;
          end;
          //Click the link
          VisitLink( sLink );
          aDoNotVisit[j] := sLink;
          GetLinksOnPage( sLink );
          //Return to the calling page once the sub-links have been
          // visited to resume verifying links.
          if sReturnLink <> "" then
            VisitLink ( sReturnLink );
           end;
        end;
      end;
    end;
  end GetLinksOnPage;

dcluser
  user
    VUser
  transactions
    TInit : begin;
    TMain : 1;
    TShutdown : end;
  var

dcltrans
  transaction  TInit
  var
    hSiteAdminStaff1 : number;
    nUser : number;
    sUser : string;
    sUser2 : string;

  begin
    //QA
    sSiteId :=  "login identifier";
    sUrlPostBin := "your link here";

    WebSetBrowser(WEB_BROWSER_MSIE6);
    WebModifyHttpHeader("Accept-Language", "en-gb");  

    //////////////////////////////////////////////////////////////////////////////
    //Retrieve the number of the current user based on user name (ie: VirtualUser_1).
    //////////////////////////////////////////////////////////////////////////////  

    sUser := Strchr(GetUser(), ord(‘_’));
    if sUser = "" then
      RaiseError(0, "Unable to determine user id: (" + sUser + ")", SEVERITY_TRANS_EXIT);
    end;
    nUser := number(Substr (sUser, sUser2, 2, 99));

  end  TInit;

  transaction  TMain
  var
    i : number;
    nAccessDenied : number;
    nPageNotFoun : number;
    nError : number;
    bURL : boolean;
    bDoNotVisit : boolean;

  begin
    print ("Main");
    //Links NOT to visit
    aDoNotVisit[1]  :=  sUrlPostBin + "/location/change.ssp";
    aDoNotVisit[2]  :=  sUrlPostBin + "/help/terms.ssp";    

    WebCookieSet("SiteCodeCookie=" + sSiteId + "; domain=.yourdomainhere.com; path=/; expires=Sat, 25 Jul 2009 21:05:09 GMT",
      sUrlPostBin);
    WebPageUrl(sUrlPostBin, "Administrative Login");

    WebPageSubmit("Login", LOGIN001, "Home");

    bURL := WebPageUrl(sURLPostBin);
    Print (string(bUrl));

    CheckTitle ( nPageNotFound, nAccessDenied, nError, bURL, sURLPostBin );

    Print (sURLPostBin);
    GetLinksOnPage( sURLPostBin );

    i := 1;

    while aDoNotVisit[i] <> "" do
      writeln;writedata ("[" + string(i) + "] – " + aDoNotVisit[i]);
      i := i + 1;
    end;

end TMain;

  transaction TShutdown
  begin
  end TShutdown;

dclform
  LOGIN001:

    "JavaScriptTest" := "yes", // hidden, changed(!)
    "SiteCode" := sSiteId , // unchanged,
    "Username" := "yourusernamehere",
    "Password" := "yourpasswordhere", // changed
    "_PageAction" := ""  ; // hidden, unchanged, value: "o"

Facebook Twitter Linkedin

Related posts:

  1. File Parsing for Troubleshooting
  2. FizzBuzz

About Mike

I am a Software Quality Assurance Professional that recently graduated college with a Bachelor's of Science in Computer Information Systems.
This entry was posted to the following categories: Silk Performer. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>