The internet never forgets! The Internet Archive can be used “as a trusted citation” for future references and is a valuable and powerful research tool – a digital time machine – the memory of the ephemeral internet (an electronic hippocampus).
Explore more than 343 billion web pages saved over time
The following plugin is useful for WordPress users who want to submit their site (and all external links) silently and periodically to the Internet Archive: en-gb.wordpress.org/plugins/post-archival/
“This entry describes the history of the internet archive from its founding in 1996 to its two billion page crawl in 2007. it describes the key individuals and organizations involved in the archive’s work and the technological innovations that make the archive possible, such as the arc file format, heritrix, and the wayback machine. the focus of this entry is primarily the internet archive’s web archiving activities and collections, but it also briefly discusses the archive’s other activities and the impact it has had on the fields of library and information science and the public in general.”
Roberts, J. R., & Drost, C. A.. (2008). Internet Archive. College & Research Libraries News
, 69(5), 286–287.
Show/hide publication abstract
“The article reviews the web site internet archive, available at www.archive.org.”
Rogers, R.. (2017). Doing Web history with the Internet Archive: screencast documentaries. Internet Histories
“Among the conceptual and methodological opportunities afforded by the internet archive, and more specifically, the wayback machine, is the capacity to capture and ‘play back’ the history a web page, most notably a website’s homepage. these playbacks could be construed as ‘website histories’, distinctive at least in principle from other uses put to the internet archive such as ‘digital history’ and ‘internet history’. in the following, common use cases for web archives are put forward in a discussion of digital source criticism. thereafter, i situate website history within traditions in web historiography. the particular approach to website history introduced here is called ‘screencast documentaries’. building upon jon udell’s pioneering screencapturing work retelling the edit history of a wikipedia page, i discuss overarching strategies for narrating screencast documentaries of websites, namely histories of the web as seen through the changes to a single page, media histories as negotiations between new and old media as well as digital histories made from scrutinising changes to the list of priorities at a tone-setting institution such as whitehouse.gov.”
This excellent open-source tool allows parallel search within a large number of documents using regular expressions. It is very useful if you look for specific information withing, say, a large number of PDF, HTML, and/or Word documents. By using (Perl) regular expressions you can use highly specific Boolean search query combinations…
// e1 is a case sensitive Perl regular expression:
// since Perl is the default option there's no need to explicitly specify the syntax used here:
boost::regex e1(my_expression);
// e2 a case insensitive Perl regular expression:
boost::regex e2(my_expression, boost::regex::perl|boost::regex::icase);
“grep” is a command-line utility for searching text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect: doing a global search with the regular expression and printing all matching lines. Grep was originally developed for the Unix operating system, but later available for all Unix-like systems. More at Wikipedia
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is ^.*\.txt$.
But you can do much more with regular expressions. You could use the regular expression \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b to search for an email address. Any email address, to be exact. A very similar regular expression (replace the first \b with ^ and the last one with $) can be used by a programmer to check whether the user entered a properly formatted email address. In just one line of code, whether that code is written in Perl, PHP, Java, a .NET language, or a multitude of other languages.
a dot matches any character. Searching for t.t will match tat as well as tut.
+
matches the previous expression one or more times, but at least once. Searching for spel+ing will find all words like speling or spelling but not speing since the l must be matched at least once.
*
matches the previous expression zero or more times. Searching for spel*ing will find all words like speling or spelling and also speing since the l can be matched zero times, which means it doesn’t have to be there.
\
the backslash escapes special characters that would otherwise be treated specially. Searching for a double dot in your text with .. would not work since the dot matches any character. To search for a double dot you have to escape the dot chars like this: \.\..
\Q..\E
in case you need to search for a literal string that has a lot of special characters in it, you can use the \Q..\E sequence. Searching for *.* would match everything unless you escape every single char like this: \*\.\*. For such search strings it’s easier to just put them inside the \Q..\E sequence like this: \Q*.*\E.
[]
With square brackets you can specify so called character classes. Such a class matches all chars that are specified between the brackets. Searching for [-+0-9]+ will find any string that contains the chars ‘-‘, ‘+’ and all chars between 0 and 9, but no other chars. It will match -123, +123 or 123, but not testword. There are a few default character classes defined so you don’t have to create one yourself. You can find a list of those classes here. The most used ones are \d which matches all digits, \w which matches all word chars and \s which matches all whitespace chars.
^, $
the caret matches the beginning of a line, and the string char $ matches the end of a line. Searching for ^title$ will only find lines that only consist of the word title, but no places where the word title is inside a line. Searching for ^// will find all lines that start with two slashes, but not lines where two slashes are not at the very beginning of a line. Searching for goodbye\.$ will find lines that end with goodbye., but not if goodbye. is somewhere inside a line.
\b
\b matches word boundaries. Searching for \bword\b finds word, but not subwords or words.
()
parenthesis pairs define a group. Grouping is useful for more advanced regex searching, but also for use when replacing text. Each group that matches part of the full matching string can be referenced later in the replace string.
|
The | char is used as an OR operator. Searching for cat|dog will match either cat or dog. Note that the OR operator uses everything left and right of the operator. If you want to limit the reach of the operator, you have to use brackets to group them. Searching for (cat|dog)food finds catfood and dogfood.