WHAT PEOPLE WRITE AS QUERIES IN THE WEB'S SEARCH ENGINES: A New Tool to Investigate the Search Behavior

Jacques Lajoie

Department de Psychologie
Université du Québec a Montréal
Montréal, Québec, Canada H3C 3P8
E-mail: lajoie.jacques@uqam.ca
http://www.psycho.uqam.ca

The central pivot of the Internet revolution may be the search engine found on the World Wide Web. It is a fascinating instrument that allow fast exploration and discovery. The search motor can also be seen as a mirror of the human soul. The availability of powerful search engines on the World Wide Web and the impression of relative privacy of their use may reveal behaviors and attitudes normally difficult to observe in human beings. That is, persons who are writing a query in a search engine may manifest a deeply personal interest not usually shared even with friends or a psychotherapist. To verify this hypothesis, a computer text analysis was done on hundreds of thousands queries coming from search engines. A part of the data (100,000 words) came courteously from Louis Monier who created the Alta Vista search engine. Another part (35,000 words) was extracted from a public on-line window giving access to the WebCrawler search engine queries. Both data bases were obtained during November 1997. A frequency analysis showed surprising results. The most popular words searched on both search engines were few and seemed to be very similar. The most popular words concerned Internet, audio-visual and computer technology. As expected, the most commonly searched words (15% of the top 200 most popular words accounted for only 3.6% of all words) were of sexual nature. Other request queries concerned the availability of free software, both legal and illegal (wares). Queries of violent nature were about inexistant. Finally, most of the words in queries were unique (15%) or double (10%) or of low frequencies: Frequencies from 1 to 9 amounted to more than 50% of the total. These findings revealed a diversity of need rarely found in research done on author work with computer text analysis. This gives a picture of Internet behavior as being totally different from traditional televisual behavior that is more conformist.

It was clear from these findings that the Internet revolution is affecting deeply the attitudes and the behaviors of the persons using it. It may increase significantly the contribution of exploration and personal initiative in learning, socialization and affective development. Analysis of search motors queries revealed not only the richness of the human curiosity and the exploration of knowledge but also the wide range of the affective needs of the individual. It is a new and fascinating type of data of interest as much for psychologists, sociologists and all information scientists and gives strong arguments to those who presents the Internet revolution as being useful and healthy.

1. INTRODUCTION

The Internet revolution constitutes a cultural event without precedent in the history of humanity. The development of gigantic computer systems easily accessible by microcomputers created technological applications in all sectors of the human activity (Khan, 1997; Maddux et al., 1997) -- communications without barrier of distance between persons being anywhere on the earth, commercial transactions of all nature, access to artistic creations, access to apprenticeship in all areas of knowledge, and other. The main contribution of the appearance of these networks is to make available to everyone being able to use a microcomputer to access a huge quantity and a diversity of almost unlimited information sources. Far more than with the traditional media such as the television, it is the user that takes initiative to explore and choose the information according to its needs.

As Robin and Starr (1997) mentioned rightly about distance teaching, the World Wide Web possesses particularities that makes it an outstanding educational tool. These include:

• Hyper-text and hyper-links give the control to the user when acquiring information,

• the multimedia presence allows the enhancement of the text by image, film and interactive demonstrations,

• the interactivity between the user and the web site server creates a direct communication between the author and the reader,

• the hyper-text programming language common to all Web sites renders it accessible to all users independent of platform and/or operation system (for example Windows or Macintosh), and

• the expansion and the improvement of the site can be made easily and rapidly without compelling the user to obtain a new upgrade (Starr, 1997).

We may anticipate more important repercussions than those that accompanied the appearance of other technological novelties as the telephone or the television. These will affect individuals as much in their work and leisure than in their family life in their cognitive, affective and social development. It is crucial that researchers and decision makers in the domains of human science and education realize the importance of the transformations brought by the revolution of the communications (Riveted and al., 1997), especially to solve problems raised by the adaptation of the child and the family and the school to these changes (Allen and al., 1995).

2. SOCIAL COMMUNICATION AND THE FAMILY

The microcomputer appeared less than twenty years ago and was used mainly at first as a programming tool or for word/data-processing. After a while, word and data processors became more popular than programming. Now, the microcomputer is used increasingly as a tool facilitating social communication (Danet, 1995; Lajoie et al., 1997; Parks et al., 1995; Parks, 1997; Walther et al., 1994) and exploration of information/knowledge. According to Seymour Papert (1996), the place where the computer is used becomes increasingly the family home. Indeed, many more homes in United States and Canada are connecting increasingly to networks. Some consequences are to be expected because the more gratifying learning by exploration is moved from the school to the home. The quality of the family relationship may be raised because the computer is used by all the family and children can help their parents to solve technical problems. However, that could worsen the perception of the school by the children. Indeed, the Internet may contribute to lessen a child's dependence to the adult (parent and/or professor) to obtain information or knowledge. It may also reinforce the conditions that increase initiative to explore and the pleasure of the discovery (Papert, 1996).

Here is how Papert explains this important phenomenon: At home, the curiosity of a small child provokes the exploration of his/her familiar surroundings. Soon, this small universe is explored extensively. The child looks, listens, and touches. Its apprenticeship is essentially non-verbal and experiential -- home-style learning rather than "school-style learning" or "learning by being told". The child realizes gradually that there exists a universe far larger than the house and this is unreachable directly by exploration. How then for this child to find answers to his/her questions? According to Papert, there are essentially three ways to cope to this problem -- think of an answer, ask someone, or just wait in the hope that one day someone in person or on television will give the answer. The longer-term consequence is less spontaneity -- a loss of the pleasure to make it oneself, more dependence on other persons, and thus more frustration and anger (Papert, 1996, pp. 2-3). The child uses all that is at reach of arm to learn. Now, he/she has or will soon have access to a home computer. Encyclopedias on compact discs and the use of the Internet increase considerably a child's liberty of choice in searching answers to his/her questions. The visionary desire of Papert to give children access to the "knowledge machine" (Papert, 1993) seems to be fulfilled more rapidly than anticipated

OBSERVE THE USE OF THE SEARCH ENGINE ON THE NET/WEB

The central pivot of this Internet revolution is the search engine found on the World Wide Web. Without the search engine, one will be impossible to sail on the Web unless one knows his/her destination. It is the main instrument of exploration (yet still imperfect) of the mass of continuously changing knowledge that resides in the Web and the only way to discover the exact information by sorting adequately the relevant elements from the wretched or useless ones. The search engine permits one to find the desired web sites by using chosen keywords. Search engineers present on the Web have proliferated fast in the last few years. Some of the more popular ones include WebCrawler, Lycos, Yahoo (1994), Alta Vista, Excite, InfoSeek (1995), Hotbot (1996), and NorthernLight (1997).

3. TECHNICAL DESCRIPTIONS AND CLASSIFI- CATION OF SEARCH ENGINES

Technical descriptions and classification of search engines can be found in many sources (Cheong, 1996; Maze et al., 1997; Sonnenreich et al., 1998; see also the Web site of Sullivan, 1997). Let us mention only that search engines function automatically (contrarily to the Yahoo catalog that need constant human intervention to record or cancel sites) and are constituted of three distinct programs:

• The first explore millions of Web sites per day (20 millions at Alta Vista) and extracts some its content,

• the second reads, records and classifies their textual content in the memory of a very powerful computer (100 millions of pages indexed at Alta Vista), and

• the third undertakes on-line researches directly on this continually renewed database.

This instrument will become in the next decade, at least in North America and in Europe, as common as the telephone or the television. Already, millions of persons use them daily. Observation of children and adolescents that are already use Internet at home for a while show that their behaviors may be totally integrated to their daily activities (Lajoie, 1998).

Instead of describing what can be found on Internet (we are already flooded with information on this subject), we will rather observe what people are looking for. The search engine can then be seen as a mirror of the human mind and soul. Rather than using traditional approaches as polls and interviews so as to reply to this question, we propose rather to examine directly what people write in search engines. The availability of powerful search engines on the World Wide Web and the impression of relative privacy of their use may reveal behaviors and attitudes normally difficult to observe in human beings. That is, persons who are writing a query in a search engine may manifest a deeply personal interest not usually shared even with friends or a psychotherapist.

4. ANALYSIS OF QUERIES

There are many questions on queries made on Internet, such as:

• What are the most popular queries?

• Are queries similar in different languages?

• Is the Web built adequately to answer to the queries?

• Are the queries reflecting the variety of the human imagination?

• Do the queries reflect the culture of the dominant media? For example, what is the real popularity of violence and pornography on Internet?

It is difficult to obtain answer to these questions because the respondents will often answer in a way to please the researcher. Direct observation may be used to study the behavior of the observed person.

Many users undertake their explorations in a framework insuring their intimacy (Lee, 1996). This can be at home, at work, at school and even in cafes connected on the Web. According to a poll from RISC (December 1997), a majority of users (at least in Quebec) prefer the anonymity when they undertake a research on the Web (RISC, December 1997).

4.1. ANALYSIS OF QUERIES

The present investigation had access to queries without any information on the origin of the demand, thus preserving the anonymity of the people.

Most of the data presented here were queries from the Alta Vista search engine and came courteously from Louis Monier, Technical Director of Alta Vista and the inventor of the Alta Vista search engine. The list consisted of 34,461 queries with a total of 92,962 words. The Alta Vista queries were extracted on November 6, 1997. The data was also compared to queries extracted on-line from the WebCrawler search engine. The WebCrawler list consisted of 15,862 queries with a total of 32,793 words and were extracted on November 27, 1997 (between 11:22am and 11:23 am). The extraction window was (and still is) available at the URL:

http://webcrawler.com/cgi.bin/SearchTicker. Table 1 presents a list of 90 queries from WebCrawler without any modification of the order and the content. Typographic and orthographic errors were not corrected. During the extraction, we submitted a query (---toto---) in another window of the Netscape browser to check the viability of the observation window. We were pleased to see our query appear half a second later in the observation window.

This specimen is representative of the WebCrawler database. WebCrawler is one of the oldest search engine (1994) and was the official search engine of American Online in 1995 and 1996. It was bought by Excite in November 1996 and is maintained as an independent search engine. Queries are of various types and show a high level of precision. The diversity of queries is so large that it is impossible to discern tendencies from such a small sample. A detailed frequency analysis of these WebCrawler queries was made by Lajoie (1998) using the Concorder computer text analysis program (Rand, 1997). Here the results from a similar analysis made on the larger Alta Vista database will be presented.

Table 2 shows the 200 most frequent significant words used in queries and they are classified in a few large categories. The list is very similar to that on the WebCrawler searches. The most popular words concerned Internet, audio-visual and computer technology. Many of them (15% of the first 200 most popular words but 3.6% of all words) were of sexual nature. In fact the most popular word is SEX with a frequency of 948. However it is important to mention that the proportion to all 98 962 words was less than 1%. Frequent queries concerned the availability of free software, legal and illegal (wares). Many words were related to locations such as name of countries, country abbreviations found in WEB addresses like DK, UK and SE. Surprisingly, none of the 200 most frequent words concerned violence while 20 words among the 200 most frequent were of sexual nature, with a total of 3392 occurrences (see Table 2). Some of these words were not sexual in nature like TEEN or ADULT but the contexts of the each of the queries were. Violence related words were searched through the whole database and about 300 of them were found. Here are the most frequently used words related to violence: GUN (38), RAPE (30), EXPLO(SION)/(SIVE) (24), DEATH (23), KILL (23), SLAVE (22), DRAGON (20), VIOL(ENCE) (20), GANG (19), WAR (19), and BLOOD (15). It is not surprising, at least for a psychologist, that sex pictures are popular on Internet. This reflects the natural curiosity of people looking through the back door in an neo-Victorian culture that has imposed strong standards of political correctness. Freud would have been delighted. However, the violence data is very surprising. This is an important finding that goes against the popular perception of the American culture as liking violence. The WebCrawler data did not show any violence either.

The fact that some words are more popular than others was an interesting result, especially for webmasters who want to analyze the commercial market. However, it was shown in Table 2 that the 200 most popular words accounted for less than 25% of the totality of the words. A rank analysis was made of the distribution of the 11,906 frequencies and the truncated results are presented in Table 3. The table ranked the frequencies starting from the largest (954: SEX) toward the last frequency of 1. It is shown that most of the words in queries were unique (15%) or double (10%) or of low frequencies. Frequencies from 1 to 9 amounted to more than 50% of the total. These results were also unexpected. Research on computer text analysis has been done usually on author work (novels and poetry) and the variety of words is relatively small (see Stubbs, 1996). Our findings revealed a huge diversity of queries and of words. Louis Monier from Alta Vista confirmed that even with large monthly databases, a surprising proportion of about 30% the queries are distinct. One reason may be the diversity of languages found in queries. Another reason may be the variety of people and of their motives to use a search engine on Internet.

5. CONCLUSIONS

Two conclusions get clear from this analysis of the queries. The first is although the << sexual words >> are among the most popular words, they accounted for only a small portion of the total queries. The second conclusion is that the principal characteristic of the queries is their diversity and originality. This gives a picture of Internet behavior as being totally different from traditional televisual behavior that is more conformist. In fact, studies of televisual behavior find their objectives principally from publicists who check the popularity (the ratings) of programs to assure the effectiveness of the publicity. Internet behavior reflects more cooperative and community values. Further analysis needs to be done to verify these hypotheses.

It was clear from these findings that the Internet revolution is affecting deeply the attitudes and the behaviors of the persons using it. It may increase significantly the contribution of exploration and personal initiative in learning, socialization and affective development. Analysis of search motors queries revealed not only the richness of the human curiosity and the exploration of knowledge but also the wide range of the most intimate affective needs of the individual. It is a new and fascinating type of data of interest as much for psychologists, sociologists and all information scientists and it may contribute to elucidate the nature of the intimate needs of human beings. It also gives strong arguments to those who presents the Internet revolution as being useful and healthy.



REFERENCES

Allen, G. and Thompson, A. (1995). Analysis of the effect of networking on computer-assisted collaborative writing in a fifth grade classroom. Journal of Educational Computing Research, 12: 65-75.

Cheong, F.-C. (1996). Internet Agents: Spiders, Wanderers, Brokers and Bots. Indianapolis, IN: New Riders Publishing.

Danet, B. (1995). Play and performance in computer-mediated communications: General introduction. Journal of Computer Mediated Communication, 1 (2): Electronic file

(http://jcmc.huji.ac.il/vol1/issue2/genintro.html).

Khan, B.H., ed. (1997). Web-based Instruction. Englewood Cliffs, NJ: Educational Technology Publications.

Lajoie, J. (1998). Les moteurs de recherche du réseau Internet comme indicateurs des besoins intimes. Submitted for publication.

Lajoie, J., Archambault, M.-A., Légaré, C. and Trudeau, J.-F. (1997). Psychologie et Internet: Communication électronique et dévelopement de liens sociaux. Affiche présentée au congres de l'ACFAS.

Lee, G. B. (1996). Addressing anonymous messages in cyberspace. Journal of Computer Mediated Communication, 2 (1): Electronic file. (http://jcmc.huji.ac.il/vol2/issue1/anon.html)

Maddux, C. D., Johnson, D. L., & Willis, J. W. (1997). Educational Computing: Learning with Tomorrow's Technologies. 2nd. ed. Boston: Allyn and Bacon.

Maze S., Moxley, D. & Smith, D.J. (1997). The Neal-Schuman Authoritative Guide to Web Search Engines. New York: Neal Schuman Publishers.

Papert, S. (1993). The Children's Machine: Rethinking School in the Age of the Com-puter. Basic Books.

Papert, S. (1996). The Connected Family: Bridging the Digital Generation Gap. Atlanta, GA: Longstreet Press.

Parks, M. R. (1997). Communication networks and relationship life cycles. In S. Duck ed., Handbook of Personal Relationships. 2nd ed. Chichester, UK: John Wiley & Sons. pp. 351-372.

Parks, M. R. & Floyd, K. (1995). Making friends in cyberspace. Journal of Computer Mediated Communication, 1 (4): Electronic file. (http://jcmc.huji.ac.il/vol1/issue4/parks.html).

Rand, D. (1997). Concorder: Concordance software for the Macintosh. Montréal: Les Publications CRM.

See also the web site: http://www.crm.umontreal.ca/~rand/Concorder.html

RISQ: Réseau Interordinateurs Scientifique Québécois.

http://www.risq.net/enquete/4/

Sonnenreich, W. & MacInta, T. (1998). Web Developer .Com Guide to Search Engines. New York: Wiley.

Starr, R. M. (1997). Delivering instruction on the World Wide Web: Overview and basic design principles. Educational Technology, 37 (3): 7-15.

Stubbs, M. (1996). Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture. Lake Oswego, OR: Blackwell Publishers.

Sullivan, D. (1997). http://searchenginewatch.com/

Walther, J. B., Anderson, J. F. & Park, D. W. (1994). Interpersonal effects in computer mediated interaction, a meta-analysis of social and antisocial communication. Communication Research, 21 (4): 460-487.