[ViewVC] Diff of: Oni2/Validate External Links/validate_external

Comparing Validate External Links/validate_external_links.sh (file contents):
Revision 1069 by iritscen, Wed Aug 2 04:26:48 2017 UTC vs.
Revision 1070 by iritscen, Tue Oct 3 03:01:32 2017 UTC

+LINKS_URL=""        # use 'curl' to download file with links from this location (can be file://)
+EXCEPT_URL=""       # ditto above for file with exceptions to NG results
+OUTPUT_DIR=""       # place reports and all other output in a folder inside this existing folder
-<
+RECORD_OK_LINKS=0   # record response code to the log whether it's a value in OK_CODES or NG_CODES
->
+RECORD_OK_LINKS=0   # record response code to the log even when it's a value in OK_CODES
+SUGGEST_SNAPSHOTS=0 # query the Internet Archive for a possible snapshot URL for each NG page
+TAKE_PAGE_SHOT=0    # take a screenshot of each OK page
-+
+CHROME_PATH=""      # path to a copy of Google Chrome that has the command-line screenshot feature
+URL_START=1         # start at this URL in LINKS_FILE (1 by default)
+URL_LIMIT=0         # if non-zero, stop at this URL in LINKS_FILE
+UPLOAD_INFO=""      # path to a file on your hard drive with the login info needed to upload a report
+# Fixed strings -- see the occurrences of these variables to learn their purpose
-<
+AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Firefox/54.0"
->
+AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0"
+ARCHIVE_API="http://archive.org/wayback/available"
+ARCHIVE_GENERIC="https://web.archive.org/web/*"
+ARCHIVE_OK_CODES="statuscodes=200&statuscodes=203&statuscodes=206"
-–
+CHROME="/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary"
+CHROME_SCREENSHOT="screenshot.png"
+CURL_CODES="http://iritscen.oni2.net/val/curl_codes.txt"
+EXPECT_SCRIPT_NAME="val_expect_sftp.txt"
+declare -a NS_NAMES=("Media" "Special" "Main" "Talk" "User" "User_talk" "OniGalore" "OniGalore_talk" "File" "File_talk" "MediaWiki" "MediaWiki_talk" "Template" "Template_talk" "Help" "Help_talk" "Category" "Category_talk" "BSL" "BSL_talk" "OBD" "OBD_talk" "AE" "AE_talk" "Oni2" "Oni2_talk" "XML" "XML_talk")
+# These arrays tell the script which suffixes at the ends of URLs represent files and which are pages.
-<
+# This determines whether the script tries to take a screenshot of the page or just gets its HTTP code.
->
+# This determines whether the script tries to take a screenshot of the URL or just gets its HTTP code.
+declare -a HTTP_FILES=(txt zip wmv jpg png m4a bsl rar oni mp3 mov ONWC vbs TRMA mp4 doc avi log gif pdf dmg exe cpp tga 7z wav east BINA xml dll dae xaf fbx 3ds blend flv csv)
+declare -a HTTP_TLDS_AND_PAGES=(com net org uk ru de it htm html php pl asp aspx shtml pgi cgi php3 x jsp phtml cfm css action stm js)
+# transcluded text, and if the transclusion fails, then the braces show up in the URL
+ILLEGAL_CHARS="{ }"
-+
+# The shortest URL possible, used for sanity-checking some URLs: http://a.co
-+
+MIN_URL_LENGTH=11
-+
+# These are parallel arrays giving the prefixes that can be used in place of normal external links to
+# some wikis and other sites
-<
+declare -a INTERWIKI_PREFIXES=(metawikipedia wikipedia wikiquote wiktionary)
-<
+declare -a INTERWIKI_DOMAINS=(meta.wikipedia.org wikipedia.org wikiquote.org wiktionary.org)
->
+declare -a INTERWIKI_PREFIXES=(commons metawikimedia mw wikibooks wikidata wikimedia wikinews wikiquote wikisource wikispecies wikiversity wikivoyage wikt wp)
->
+declare -a INTERWIKI_DOMAINS=(commons.wikimedia.org meta.wikimedia.org mediawiki.org wikibooks.org wikidata.org wikimediafoundation.org wikinews.org wikiquote.org wikisource.org species.wikimedia.org wikiversity.org wikivoyage.org wiktionary.org wikipedia.org)
+# Variables for keeping track of main loop progress and findings
+LINK_NUM=0
-+
+EI_LINKS=0
-+
+IW_LINKS=0
+OK_LINKS=0
+RD_LINKS=0
-–
+IW_LINKS=0
+NG_LINKS=0
+SKIP_UNK_NS=0
+SKIP_JS_PAGE=0
+SKIP_NON_ASCII=0
+SKIP_UNK_SUFFIX=0
+SKIP_UNK_CODE=0
-<
+SKIP_EXCEPT=0
->
+SKIP_EXPECT_NG=0
->
+SKIP_EXPECT_EI=0
->
+SKIP_EXPECT_IW=0
+FILE_LINKS=0
+PAGE_LINKS=0
+SKIPPED_HEADER_ROW=0
+SYNOPSIS
+       validate_external_links.sh --help
-<
+       validate_external_links.sh --links URL --output PATH [--exceptions FILE]
-<
+          [--record-ok-links] [--suggest-snapshots] [--take-screenshots]
-<
+          [--start-url NUM] [--end-url NUM] [--upload PATH]
->
+       validate_external_links.sh --links URL --output DIR [--exceptions URL]
->
+          [--record-ok-links] [--suggest-snapshots] [--take-screenshots DIR]
->
+          [--start-url NUM] [--end-url NUM] [--upload FILE]
+DESCRIPTION
+       This script parses a list of external links found in the OniGalore wiki
+       (which is dumped by the Oni2.net domain periodically in a particular
+       format), validates them using the Unix tool 'curl', and produces a report
-<
+       of which links were OK (responded positively to an HTTP query), which
-<
+       were RD (responded with a 3xx redirect code), which could be IW (inter-
-<
+       wiki) links, and which were NG (no good; a negative response to the
->
+       of which links were "OK" (responded positively to an HTTP query), which
->
+       were "RD" (responded with a 3xx redirect code), which could be "IW"
->
+       (interwiki) links, which are "EI" (external internal) links and could be
->
+       intrawiki links, and which were "NG" (no good; a negative response to the
+       query). This report can then be automatically uploaded to the location of
+       your choice. The script can also suggest Internet Archive snapshots for
-<
+       NG links, and take screenshots of OK links for visual verification by the
-<
+       reader that the page in question is the one intended to be displayed.
->
+       "NG" links, and take screenshots of "OK" links for visual verification by
->
+       the reader that the page in question is the one intended to be displayed.
+       You must pass this script the URL at which the list of links is found
-<
+       (--links) and the path where logs should be outputted (--output). All
-<
+       other arguments are optional.
->
+       (--links) and the path where the directory of logs should be outputted
->
+       (--output). All other arguments are optional.
+OPTIONS
-<
+       --help              Show this page
-<
+       --links URL         URL from which to download file with external links
-<
+                           (note that this can be a local file if you use the
-<
+                           file:// protocol) (required)
-<
+       --output DIR        Place the folder which will contain the reports and
-<
+                           optional screenshots at this path (required)
-<
+       --exceptions URL    In order to remove links from the list which show as
-<
+                           NG but which you regard as OK, prepare a plain-text
-<
+                           file where each line contains a response code being
-<
+                           returned and the URL returning it, separated by a
-<
+                           comma, e.g. "403,http://www.example.com" (note that
-<
+                           this can be a local file if you use the
-<
+                           file:// protocol)
-<
+       --record-ok-links   Log a link in the report even if its response code is
-<
+                           OK
-<
+       --suggest-snapshots Query the Internet Archive for a possible snapshot
-<
+                           URL for each NG page
-<
+       --take-screenshots  Save screenshots of each OK page (requires Google
-<
+                           Chrome to be found at the path in CHROME)
-<
+       --start-url NUM     Start at this link in the links file
-<
+       --end-url NUM       Stop at this link in the links file
-<
+       --upload FILE       Upload report using info in this local file
->
+       --help                 Show this page.
->
+       --links URL            (required) URL from which to download the CSV file
->
+                              with external links. Note that this URL can be a
->
+                              local file if you supply a file:// path.
->
+       --output DIR           (required) Place the folder which will contain the
->
+                              reports and optional screenshots at this (Unix-
->
+                              format) path.
->
+       --exceptions URL       In order to remove links from the report which Val
->
+                              finds an issue with, but which you regard as OK,
->
+                              list those desired exceptions in this file. See
->
+                              the sample file exceptions.txt for details. Note
->
+                              that this text file can be a local file if you
->
+                              supply a file:// path.
->
+       --record-ok-links      Log a link in the report even if its response code
->
+                              is "OK".
->
+       --suggest-snapshots    Query the Internet Archive for a possible snapshot
->
+                              URL for each "NG" page.
->
+       --take-screenshots DIR Use the copy of Google Chrome at this path to take
->
+                              screenshots of each "OK" page.
->
+       --start-url NUM        Start at this link in the link dump CSV file.
->
+       --end-url NUM          Stop at this link in the link dump CSV file.
->
+       --upload FILE          Upload report using the credentials in this local
->
+                              text file. See sftp_login.txt for example.
+BUGS
+       The script cannot properly parse any line in the external links file
+# Parse arguments as long as there are more arguments to process
+while (( "$#" )); do
+   case "$1" in
-<
+      --links )             LINKS_URL="$2";      shift 2;;
-<
+      --exceptions )        EXCEPT_URL="$2";     shift 2;;
-<
+      --output )            OUTPUT_DIR="$2";     shift 2;;
-<
+      --record-ok-links )   RECORD_OK_LINKS=1;   shift;;
-<
+      --suggest-snapshots ) SUGGEST_SNAPSHOTS=1; shift;;
-<
+      --take-screenshots )  TAKE_PAGE_SHOT=1;    shift;;
-<
+      --start-url )         URL_START=$2;        shift 2;;
-<
+      --end-url )           URL_LIMIT=$2;        shift 2;;
-<
+      --upload )            UPLOAD_INFO=$2;      shift 2;;
->
+      --links )             LINKS_URL="$2";                     shift 2;;
->
+      --exceptions )        EXCEPT_URL="$2";                    shift 2;;
->
+      --output )            OUTPUT_DIR="$2";                    shift 2;;
->
+      --record-ok-links )   RECORD_OK_LINKS=1;                  shift;;
->
+      --suggest-snapshots ) SUGGEST_SNAPSHOTS=1;                shift;;
->
+      --take-screenshots )  TAKE_PAGE_SHOT=1; CHROME_PATH="$2"; shift 2;;
->
+      --start-url )         URL_START=$2;                       shift 2;;
->
+      --end-url )           URL_LIMIT=$2;                       shift 2;;
->
+      --upload )            UPLOAD_INFO=$2;                     shift 2;;
+      * )                   echo "Invalid argument $1 detected. Aborting."; exit 1;;
+  esac
+done
+# If the required arguments were not supplied, print help page and quit
+if [ -z $LINKS_URL ] || [ -z $OUTPUT_DIR ]; then
-<
+   printHelp
-<
+   echo "Error: I did not receive one or both required arguments."
->
+   echo "Error: I did not receive one or both required arguments. Run me with the \"--help\" argument for documentation."
+   exit 2
+fi
-+
+# If user wants screenshots, make sure path to Chrome was passed in and is valid
-+
+if [ $TAKE_PAGE_SHOT -eq 1 ]; then
-+
+   if [ ! -f "$CHROME_PATH" ]; then
-+
+      echo "Error: You need to supply a path to the Google Chrome application in order to take screenshots."
-+
+      exit 3
-+
+   fi
-+
+fi
-+
+# Check that UPLOAD_INFO exists, if this argument was supplied
+if [ ! -z $UPLOAD_INFO ] && [ ! -f "$UPLOAD_INFO" ]; then
+   echo "Error: The file $UPLOAD_INFO supplied by the --upload argument does not appear to exist. Aborting."
-<
+   exit 3
->
+   exit 4
+fi
+# Check that OUTPUT_DIR is a directory
+if [ ! -d "$OUTPUT_DIR" ]; then
+   echo "Error: The path $OUTPUT_DIR supplied by the --output argument does not appear to be a directory. Aborting."
-<
+   exit 4
->
+   exit 5
+fi
+# Make timestamped folder inside OUTPUT_DIR for this session's log and screenshots
+# Check that 'mkdir' succeeded
+if [ ! -d "$OUTPUT_PATH" ]; then
+   echo "Error: I could not create the folder \"$OUTPUT_FOLDER\" inside the directory $OUTPUT_PATH. Aborting."
-<
+   exit 5
->
+   exit 6
+fi
+# Get date on the file at LINKS_URL and print to log
+LINKS_DATE=$(curl --silent --head $LINKS_URL | grep "Last-Modified")
+if [ -z "$LINKS_DATE" ]; then
+   echo "Error: I could not find the external links file at the path \"$LINKS_URL\" supplied by the --links argument. Aborting."
-<
+   exit 6
->
+   exit 7
+fi
+LINKS_DATE=${LINKS_DATE#Last-Modified: }
+# The central logging function. The first parameter is a string composed of one or more characters that
-<
+# indicates which output to use: 'c' means console, 't' means the TXT log, 'r' means the RTF log, and
->
+# indicate which output to use: 'c' means console, 't' means the TXT log, 'r' means the RTF log, and
+# 'h' means the HTML log. 'n' means "Don't print a newline at the end of the line." 'w' means "Don't
+# pass console output through 'fmt'" ("fmt" fits the output to an 80-column CLI but can break special
+# formatting and the 'n' option).
+   valPrint ctrh "I skipped $LINKS_SKIPPED $(pluralCheckNoun link $LINKS_SKIPPED), and found $FILE_LINKS $(pluralCheckNoun file $FILE_LINKS) and $PAGE_LINKS $(pluralCheckNoun page $PAGE_LINKS)."
+   if [ $LINKS_SKIPPED -gt 0 ]; then valPrint ctrh "Skip breakdown: "; fi
+   if [ $SKIP_UNK_NS -gt 0 ]; then valPrint ctrh "- $SKIP_UNK_NS unknown $(pluralCheckNoun namespace $SKIP_UNK_NS)"; fi
-<
+   if [ $SKIP_JS_PAGE -gt 0 ]; then valPrint ctrh "- $SKIP_JS_PAGE links on JavaScript $(pluralCheckNoun page $SKIP_JS_PAGE)"; fi
->
+   if [ $SKIP_JS_PAGE -gt 0 ]; then valPrint ctrh "- $SKIP_JS_PAGE $(pluralCheckNoun link $SKIP_JS_PAGE) on $(pluralCheckA $SKIP_JS_PAGE)JavaScript $(pluralCheckNoun page $SKIP_JS_PAGE)"; fi
+   if [ $SKIP_BAD_URL -gt 0 ]; then valPrint ctrh "- $SKIP_BAD_URL illegal $(pluralCheckNoun URL $SKIP_BAD_URL)"; fi
+   if [ $SKIP_NON_ASCII -gt 0 ]; then valPrint ctrh "- $SKIP_NON_ASCII non-ASCII $(pluralCheckNoun URL $SKIP_NON_ASCII)"; fi
+   if [ $SKIP_UNK_SUFFIX -gt 0 ]; then valPrint ctrh "- $SKIP_UNK_SUFFIX unknown URL $(pluralCheckNoun suffix $SKIP_UNK_SUFFIX)"; fi
+   if [ $SKIP_UNK_CODE -gt 0 ]; then valPrint ctrh "- $SKIP_UNK_CODE unknown response $(pluralCheckNoun code $SKIP_UNK_CODE)"; fi
-<
+   valPrint ctrh "Out of the $LINKS_CHECKED links checked, $IW_LINKS could be $(pluralCheckAn $IW_LINKS)interwiki $(pluralCheckNoun link $IW_LINKS), $OK_LINKS $(pluralCheckWas $OK_LINKS) OK, $RD_LINKS $(pluralCheckWas $RD_LINKS) $(pluralCheckA $RD_LINKS)redirection $(pluralCheckNoun notice $RD_LINKS), and $NG_LINKS $(pluralCheckWas $NG_LINKS) NG."
-<
+   if [ $SKIP_EXCEPT -gt 0 ]; then
-<
+      valPrint ctrh "$SKIP_EXCEPT/$NG_LINKS NG $(pluralCheckNoun link $NG_LINKS) went unlisted due to being found in the exceptions file."
->
+   valPrint ctrh "Out of the $LINKS_CHECKED links checked, $EI_LINKS could be $(pluralCheckAn $EI_LINKS)intrawiki $(pluralCheckNoun link $EI_LINKS), $IW_LINKS could be $(pluralCheckAn $IW_LINKS)interwiki $(pluralCheckNoun link $IW_LINKS), $OK_LINKS $(pluralCheckWas $OK_LINKS) OK, $RD_LINKS $(pluralCheckWas $RD_LINKS) $(pluralCheckA $RD_LINKS)redirection $(pluralCheckNoun notice $RD_LINKS), and $NG_LINKS $(pluralCheckWas $NG_LINKS) NG."
->
+   if [ $SKIP_EXPECT_NG -gt 0 ]; then
->
+      valPrint ctrh "$SKIP_EXPECT_NG/$NG_LINKS NG $(pluralCheckNoun link $NG_LINKS) went unlisted due to being found in the exceptions file."
->
+   fi
->
+   if [ $SKIP_EXPECT_EI -gt 0 ]; then
->
+      valPrint ctrh "$SKIP_EXPECT_EI/$EI_LINKS external internal $(pluralCheckNoun link $EI_LINKS) went unlisted due to being found in the exceptions file."
+   fi
-+
+   if [ $SKIP_EXPECT_IW -gt 0 ]; then
-+
+      valPrint ctrh "$SKIP_EXPECT_IW/$IW_LINKS potential intrawiki $(pluralCheckNoun link $IW_LINKS) went unlisted due to being found in the exceptions file."
-+
+   fi
-+
+   valPrint trh "ValExtLinks says goodbye."
+   printRTFfooter
+   printHTMfooter
+# Attempt to download file at EXCEPT_URL, then check that it succeeded
+if [ ! -z $EXCEPT_URL ]; then
-<
+   valPrint cwtrh "Downloading list of NG exceptions from $EXCEPT_URL."
->
+   valPrint cwtrh "Downloading list of reporting exceptions from $EXCEPT_URL."
+   EXCEPT_FILE_NAME=$(echo "$EXCEPT_URL" | sed 's/.*\///')
+   EXCEPT_FILE="$OUTPUT_PATH/$EXCEPT_FILE_NAME"
+   curl --silent -o "$EXCEPT_FILE" $EXCEPT_URL
+fi
+# Print settings to console and log
-<
+declare -a SETTINGS_MSG=(I will be checking the response code of each link "and will" take a screenshot of each page. Pages that are OK will "also" be logged. I "will" ask the Internet Archive for a suggested snapshot URL for each NG page. "I will not print NG links that are listed in the exceptions file.")
->
+declare -a SETTINGS_MSG=(I will be checking the response code of each link "and will" take a screenshot of each page. Pages that are OK will "also" be logged. I "will" ask the Internet Archive for a suggested snapshot URL for each NG page. "I will not report links that are listed in the exceptions file.")
+if [ $TAKE_PAGE_SHOT -eq 0 ]; then SETTINGS_MSG[10]="but will not"; fi
+if [ $RECORD_OK_LINKS -eq 0 ]; then SETTINGS_MSG[22]="not"; fi
+if [ $SUGGEST_SNAPSHOTS -eq 0 ]; then SETTINGS_MSG[26]="will not"; fi
+valPrint trh "OK = URL seems to be working."
+valPrint trh "NG = URL no longer seems to work. You should click each URL marked as NG before attempting to fix it, because false negatives will occur from time to time due to hiccups in the Internet. Please report any persistent false negatives or other issues to Iritscen. An NG link should be followed by a link to the Internet Archive's Wayback Machine which may help you repair the link. If the link cannot be repaired, you can disable it on the wiki (which prevents it from showing up in future ValExtLinks reports) by wrapping it in nowiki tags."
+valPrint trh "RD = The server responding to this URL is saying that the page moved and you should instead use the supplied new URL. Some RD links represent minor adjustments in the organization of a web site, and some are soft 404s (the file/page has been removed and you are being redirected to something like the main page of the web site). You will have to look at the new URL yourself to determine if it represents an OK link and the link on the wiki should be updated to this one, or if the desired file/page is actually gone and we need to replace the wiki link with an Internet Archive snapshot link -- or disable the URL if it has not been archived."
-<
+valPrint trh "IW = URL is working but should be converted to interwiki link using the suggested markup."
->
+valPrint trh "EI = URL is an external link to an internal page and should be converted to an intrawiki link using the suggested markup."
->
+valPrint trh "IW = URL is an external link to a fellow wiki and should be converted to an interwiki link using the suggested markup."
+valPrint t "(xxx) = Unix tool 'curl' obtained this HTTP response status code (see here for code reference: $HTTP_CODES)."
+valPrint r "(xxx) = Unix tool 'curl' obtained this HTTP response status code (see {\field{\*\fldinst{HYPERLINK \"$HTTP_CODES\"}}{\fldrslt here}} for code reference)."
+valPrint h "(xxx) = Unix tool 'curl' obtained this HTTP response status code (see <a href=\"$HTTP_CODES\" target=\"_blank\">here</a> for code reference)."
+valPrint r "(000-xx) = 'curl' did not get an HTTP response code, but returned this exit code (see {\field{\*\fldinst{HYPERLINK \"$CURL_CODES\"}}{\fldrslt here}} for code reference)."
+valPrint h "(000-xx) = 'curl' did not get an HTTP response code, but returned this exit code (see <a href=\"$CURL_CODES\" target=\"_blank\">here</a> for code reference)."
+valPrint trh "IA suggests = Last available snapshot suggested by the Internet Archive."
-<
+valPrint trh "Try browsing = The Archive occasionally fails to return a snapshot URL even when one exists, so you will need to check for a snapshot manually using the Wayback Machine before concluding that a site has not been archived."
->
+valPrint trh "Try browsing = The Archive occasionally fails to return a snapshot URL even when one exists, so you will need to check for a snapshot manually using this link to the Wayback Machine before concluding that a site has not been archived."
+valPrint trh ""
+      continue
+   fi
-+
+   # Build longer wiki page URLs from namespace and page names
-+
+   FULL_PAGE_PATH=http://$WIKI_PATH/$NS_NAME:$PAGE_NAME
-+
+   LOCAL_PAGE_PATH=$NS_NAME:$PAGE_NAME
-+
+   # Namespace "Main:" cannot be a part of the path; it's an implicit namespace, and naming it
-+
+   # explicitly breaks the link
-+
+   if [ $NS_ID -eq 0 ]; then
-+
+      FULL_PAGE_PATH=http://$WIKI_PATH/$PAGE_NAME
-+
+      LOCAL_PAGE_PATH=$PAGE_NAME
-+
+   fi
-+
+   # The URL being linked to is everything after the previous two fields (this allows commas to be in
+   # the URLs, but a comma in the previous field, the page name, will break this)
+   URL=${LINE#$NS_ID,$PAGE_NAME,}
+   HAS_SUFFIX=0
+   # If the URL ends in something like ".php?foo=bar", strip everything from the '?' onward
-<
+   SAN_URL=${URL%%\?*}
->
+   CLEAN_URL=${URL%%\?*}
+   # If the URL ends in something like "#section_15", strip everything from the '#' onward
-<
+   SAN_URL=${SAN_URL%%\#*}
->
+   CLEAN_URL=${CLEAN_URL%%\#*}
+   # 'sed' cannot handle Unicode in my Bash shell, so skip this URL and make user check it
-<
+   if [[ $SAN_URL == *[![:ascii:]]* ]]; then
->
+   if [[ $CLEAN_URL == *[![:ascii:]]* ]]; then
+      valPrint tr "Skipping URL $URL (found on page $PAGE_NAME) because I cannot handle non-ASCII characters."
+      let SKIP_NON_ASCII+=1
+      continue
+   fi
+   # Isolate the characters after the last period and after the last slash
-<
+   POST_DOT=$(echo "$SAN_URL" | sed 's/.*\.//')
-<
+   POST_SLASH=$(echo "$SAN_URL" | sed 's/.*\///')
->
+   POST_DOT=$(echo "$CLEAN_URL" | sed 's/.*\.//')
->
+   POST_SLASH=$(echo "$CLEAN_URL" | sed 's/.*\///')
+   # If the last period comes after the last slash, then the URL ends in a suffix
+   POST_DOT_LENGTH=$(echo | awk -v input=$POST_DOT '{print length(input)}')
+      CURL_RESULT="$CURL_RESULT-$CURL_ERR"
+   fi
-<
+   # Determine our status code for this URL (IW, OK, RD, or NG)
->
+   # Begin to determine our status code for this URL (EI, IW, OK, RD, or NG)
+   STATUS="??"
+   NEW_URL=""
+   INTERWIKI_INDEX=-1
-<
+   # First check if this is a link to a domain that we have an interwiki prefix for
-<
+   for ((i = 0; i < ${#INTERWIKI_DOMAINS[@]}; ++i)); do
-<
+      if [[ $URL == *${INTERWIKI_DOMAINS[$i]}* ]]; then
-<
+         STATUS="IW"
-<
+         let IW_LINKS+=1
-<
+         INTERWIKI_INDEX=$i
-<
+         break
-<
+      fi
-<
+   done
->
->
+   # First make sure that this isn't an "external internal" link to our own wiki that can be replaced
->
+   # by "[[page_name]]". If it uses a special access URL beginning with "/w/", let it pass, as it
->
+   # probably cannot be replaced by "[[ ]]" markup
->
+   if [[ $URL == *$WIKI_PATH* ]] && [[ $URL != *$WIKI_PATH/w/* ]]; then
->
+      STATUS="EI"
->
+      let EI_LINKS+=1
->
+   fi
->
->
+   # If it's not, check if this is a link to a domain that we have an interwiki prefix for
->
+   if [ $STATUS == "??" ]; then
->
+      for ((i = 0; i < ${#INTERWIKI_DOMAINS[@]}; ++i)); do
->
+         if [[ $URL == *${INTERWIKI_DOMAINS[$i]}* ]] && [[ $URL != *${INTERWIKI_DOMAINS[$i]}/w/* ]]; then
->
+            STATUS="IW"
->
+            let IW_LINKS+=1
->
+            INTERWIKI_INDEX=$i
->
+            break
->
+         fi
->
+      done
->
+   fi
+   # If we didn't match an interwiki domain, see if the status code is in our "OK" codes list
+   if [ $STATUS == "??" ]; then
+            # Get URL header again in order to retrieve the URL we are being redirected to
+            NEW_URL=$(curl -o /dev/null --silent --insecure --head --user-agent '"$AGENT"' --max-time 10 --write-out '%{redirect_url}\n' $URL)
-<
+            # Check if the redirect URL is just the original URL with https:// instead of http://
-<
+            # (this happens a lot and is not an important correction to us); if so, just make it "OK"
-<
+            URL_NO_PROTOCOL=${URL#*://}
-<
+            NEW_URL_NO_PROTOCOL=${NEW_URL#*://}
->
+            # Filter out cases where the redirect URL is just the original URL with https:// instead of
->
+            # http://, or with an added '/' at the end. These corrections happen a lot and are not
->
+            # important to us.
->
+            URL_NO_PROTOCOL=${URL#http://}
->
+            URL_NO_PROTOCOL=${URL_NO_PROTOCOL%/}
->
+            NEW_URL_NO_PROTOCOL=${NEW_URL#https://}
->
+            NEW_URL_NO_PROTOCOL=${NEW_URL_NO_PROTOCOL%/}
->
->
+            # Sometimes 'curl' fails to get the redirect_url due to time-out or bad web site config
->
+            NEW_URL_LENGTH=$(echo | awk -v input=$NEW_URL_NO_PROTOCOL '{print length(input)}')
->
+            if [ $NEW_URL_LENGTH -lt $MIN_URL_LENGTH ]; then
->
+               NEW_URL_NO_PROTOCOL="[new URL not retrieved]"
->
+            fi
->
->
+            # If the URLs match after the above filters were applied, then the link is OK
+            if [ $URL_NO_PROTOCOL == $NEW_URL_NO_PROTOCOL ]; then
+               STATUS="OK"
+               let OK_LINKS+=1
+      continue
+   fi
-<
+   # If link is "NG" and there is an exceptions file, compare URL against the list before logging it
-<
+   if [ $STATUS == "NG" ] && [ ! -z $EXCEPT_URL ]; then
->
+   # Check problem links against exceptions file before proceeding
->
+   if [ $STATUS != "OK" ] && [ ! -z $EXCEPT_URL ]; then
->
+      # The code we expect to find in the exceptions file is either the 'curl' result or "EI"/"IW"
->
+      EXPECT_CODE="$CURL_RESULT"
->
+      if [ $STATUS == "EI" ]; then
->
+         EXPECT_CODE="EI"
->
+      elif [ $STATUS == "IW" ]; then
->
+         EXPECT_CODE="IW"
->
+      fi
->
->
+      # Look for link in exceptions file and make sure its listed result code and wiki page also match
+      GREP_RESULT=$(grep --max-count=1 "$URL" "$EXCEPT_FILE")
-<
+      EXCEPT_CODE=${GREP_RESULT%%,*}
-<
+      if [ "$EXCEPT_CODE" == $CURL_RESULT ]; then
-<
+         valPrint tr "Skipping URL $URL (found on page $PAGE_NAME) because its status code, $CURL_RESULT, is listed in the exceptions file."
-<
+         let SKIP_EXCEPT+=1
-<
+         continue
->
+      EXCEPT_PAGE=${GREP_RESULT##*,}
->
+      if [ "$EXCEPT_PAGE" == "*" ] || [ "$EXCEPT_PAGE" == $LOCAL_PAGE_PATH ]; then
->
+         EXCEPT_CODE=${GREP_RESULT%%,*}
->
+         if [ "$EXCEPT_CODE" == "$EXPECT_CODE" ]; then
->
+            valPrint tr "Skipping URL $URL (found on page $PAGE_NAME) because its expected result, $EXPECT_CODE, is listed in the exceptions file."
->
+            if [ $STATUS == "EI" ]; then
->
+               let SKIP_EXPECT_EI+=1
->
+            elif [ $STATUS == "IW" ]; then
->
+               let SKIP_EXPECT_IW+=1
->
+            else
->
+               let SKIP_EXPECT_NG+=1
->
+            fi
->
+            continue
->
+         fi
+      fi
+   fi
+   # If appropriate, record this link to the log, with clickable URLs when possible
+   if [ $STATUS != "OK" ] || [ $RECORD_OK_LINKS -eq 1 ]; then
-<
+      FULL_PAGE_PATH=http://$WIKI_PATH/$NS_NAME:$PAGE_NAME
-<
+      LOCAL_PAGE_PATH=$NS_NAME:$PAGE_NAME
-<
+      # Namespace "Main:" cannot be a part of the path; it's an implicit namespace, and naming it explicitly breaks the link
-<
+      if [ $NS_ID -eq 0 ]; then
-<
+         FULL_PAGE_PATH=http://$WIKI_PATH/$PAGE_NAME
-<
+         LOCAL_PAGE_PATH=$PAGE_NAME
-<
+      fi
-<
-<
+      # Stupid hack since the text "IW" is narrower than "OK", "RD", or "NG" and it takes an extra tab
-<
+      # to get to the desired level of indentation in the RTF log
->
+      # Stupid hack since the strings "IW" and "EI" are narrower than "OK", "RD", or "NG" and it takes
->
+      # an extra tab to get to the desired level of indentation in the RTF log
+      RTF_TABS="        "
-<
+      if [ $STATUS == "IW" ]; then
->
+      if [ $STATUS == "IW" ] || [ $STATUS == "EI" ]; then
+         RTF_TABS="             "
+      fi
+         valPrint hn "<tr><td colspan=\"2\" align=\"right\">Server suggests</td><td><a href=\"$NEW_URL\" target=\"_blank\">$NEW_URL</a></td></tr>"
+      fi
-+
+      # Get everything after domain name in URL for use in EI and IW listings
-+
+      POST_DOMAIN=${URL#*://*/}
-+
-+
+      # Notify reader if we can use an intrawiki link for this URL
-+
+      if [ $STATUS == "EI" ]; then
-+
+         valPrint t "  Just use [[$POST_DOMAIN]]"
-+
+         valPrint r "           Just use [[$POST_DOMAIN]]"
-+
+         valPrint hn "<tr><td colspan=\"2\" align=\"right\">Just use</td><td>[[$POST_DOMAIN]]</td></tr>"
-+
+      fi
-+
+      # Notify reader if we can use an interwiki prefix for this URL
+      if [ $STATUS == "IW" ]; then
-<
+         valPrint t "  You can use [[${INTERWIKI_PREFIXES[$INTERWIKI_INDEX]}:$POST_SLASH]]"
-<
+         valPrint r "           You can use [[${INTERWIKI_PREFIXES[$INTERWIKI_INDEX]}:$POST_SLASH]]"
-<
+         valPrint hn "<tr><td colspan=\"2\" align=\"right\">You can use</td><td>[[${INTERWIKI_PREFIXES[$INTERWIKI_INDEX]}:$POST_SLASH]]</td></tr>"
->
+         valPrint t "  You can use [[${INTERWIKI_PREFIXES[$INTERWIKI_INDEX]}:$POST_DOMAIN]]"
->
+         valPrint r "           You can use [[${INTERWIKI_PREFIXES[$INTERWIKI_INDEX]}:$POST_DOMAIN]]"
->
+         valPrint hn "<tr><td colspan=\"2\" align=\"right\">You can use</td><td>[[${INTERWIKI_PREFIXES[$INTERWIKI_INDEX]}:$POST_DOMAIN]]</td></tr>"
+      fi
+      # Query Internet Archive for latest "OK" snapshot for "NG" page
+      # Don't take screenshot if we already encountered this page and screenshotted it
+      if [ ! -f "$SHOT_FILE" ]; then
-<
+         "$CHROME" --headless --disable-gpu --screenshot --window-size=1500,900 $URL > /dev/null 2>&1
->
+         "$CHROME_PATH" --headless --disable-gpu --screenshot --window-size=1500,900 $URL > /dev/null 2>&1
+         if [ -f "$WORKING_DIR/$CHROME_SCREENSHOT" ]; then
+            mv -n "$WORKING_DIR/$CHROME_SCREENSHOT" "$SHOT_FILE"
+         else

Diff Legend

-–
+Removed lines
-+
+Added lines
-<
+Changed lines (old)
->
+Changed lines (new)

Comparing Validate External Links/validate_external_links.sh (file contents): Revision 1069 by iritscen, Wed Aug 2 04:26:48 2017 UTC vs. Revision 1070 by iritscen, Tue Oct 3 03:01:32 2017 UTC

Diff Legend

Comparing Validate External Links/validate_external_links.sh (file contents):
Revision 1069 by iritscen, Wed Aug 2 04:26:48 2017 UTC vs.
Revision 1070 by iritscen, Tue Oct 3 03:01:32 2017 UTC