Skip to main content

HathiTrust

HathiTrust Research Center

What is the HathiTrust Research Center?

The HathiTrust Research Center (HTRC) is the research arm of the HathiTrust Digital Library.  Within the HTRC, you can use the public domain materials within the HathiTrust Digital Library to perform "non-consumptive" research.

Using the HTRC, you can:

  • browse and make worksets (datasets),
  • run a variety of algorithms (topic modeling, word clouds, classification, etc) on your worksets from within the portal, 
  • explore temporal trends of the corpus,
  • and more

You can use the tabs along the top of this guide to learn more about these functions.

Creating an account

To use the HTRC, you will need to create an account.  The HTRC does not use the same login information as the HathiTrust Digital Library.

Click on "sign up" on the top right hand corner of the HTRC home page, and enter your information.

screen shot of HTRC home page with login

Worksets are collections of digitized volumes and metadata.  Worksets can serve the following functions:

  • Use faceted search to build a manageable corpus of materials
  • Organize materials of interest into one place
  • Delimit the scope of your analysis

 

Creating a workset in the workset builder

The workset builder is comprised of a search interface where you can create a collection of text(s) to run algorithms on.  

From the home page of the HTRC portal, click on "create a workset."  You will have to log in yet again.

HTRC home page showing where to create a workset

Searching for, and selecting items for your workset

  1. Search: The workset builder features an interface somewhat like a library catalogue.  Search using your preferred criteria to create a custom workset.  The results page features a number of faceted limits where you can further refine your results.

    Tip: Because the metadata in the HathiTrust collection is not always accurate, searching by full text is recommended.
     
  2. Select: You can manually select boxes of items you want, or you can select all items on the page, or all items retrieved in the search.
    • Click on "selected items" in the top right hand corner of the screen
    • The resulting screen should show a list of your selected items
    • Click on "Create/Update Workset"
    • You can continue adding or removing items from your workset as you search

Screenshot of HTRC workset builder showing features

Your new workset is ready for analysis.  To learn how to run the HTRC's built-in algorithms against your workset, click on the "algorithms" tab at the top of this guide.

 

Using HTRC's built-in algorithms

Once you have created a workset, you must return to the main HTRC portal to run analyses of your worksets.

By clicking on the "algorithms" link on the top of the main HTRC page, you will be directed to a list of currently available algorithms, with a description of each algorithm's function.  

Click on the name of your chosen algorithm to execute it.  You will be prompted to choose which workset you are analyzing, and to choose a name for your job.  Once you click on "submit" the algorithm will begin to run.  You will be taken to the job staging screen where you can track the progress of your job.  Most jobs will take some time to complete.

Once your job is complete it will move to the "completed jobs" section.  Click on the blue link to view results.

HTRC job results

This image shows topic modeling results on the novel Little Dorrit, by Charles Dickens.  "Topics" are defined as groups of words that are more likely to occur in the same vicinity within a given workset.

 

HathiTrust Digital Library Bookworm

Bookworm allows for the visual exploration of lexical trends.  You can search for word(s) within the public domain component of the HathiTrust Digital Library, and display the results on a graph.

bookworm image

Advanced features

Users with more technical skills can use the HTRC in other ways:

  • Feature extraction dataset: When the HTRC's built-in algorithms do not suffice, a non-consumptive dataset of "features" is available for download.  These features include part-of-speech tagged token counts, line and sentence counts, start and end of line character counts, header/footer identification, and hypenation rejoining.
  • HTRC Data Capsule: In the data capsule, you can create a secure virtual machine to perform custom analyses of worksets.  There are restrictions on the types of research that can be carried out in the capsule, to ensure that results are "non consumptive."
Loading