Skip to Main Content

Digital Scholarship Center: Acrobat Pro DC: Optical Character Recognition (OCR) for Research, Learning, and Accessibility

Information about the resources in the Digital Scholarship Center

Acrobat Pro DC: Optical Character Recognition (OCR) for Research, Learning, and Accessibility

Acrobat Pro DC: Optical Character Recognition (OCR) for Research, Learning, and Accessibility

 

This video tutorial, by the Rowan University Libraries Digital Scholarship Center, introduces users to the OCR (Optical Character Recognition) feature in Acrobat Pro DC. The software can be used to quickly convert scanned documents to text, increasing the speed of research and learning, and making the files more ADA accessible for those using screen readers.

Closed captions available through the video's  button. The transcript appears below.

 

 

Transcript:

00:00:04.560 --> 00:00:08.240
Hello my name is Mike Benson
and I'm the Coordinator of the Rowan

00:00:08.240 --> 00:00:13.280
University Libraries
Digital Scholarship Center. In this video

00:00:13.280 --> 00:00:18.320
tutorial we will learn how to use
Acrobat Pro to create a searchable pdf

00:00:18.320 --> 00:00:22.800
using acrobat's built-in optical
character recognition

00:00:22.800 --> 00:00:28.800
also known as OCR. OCR
is an amazingly useful feature which can

00:00:28.800 --> 00:00:33.520
convert text in a photograph
or a scanned document in order to make

00:00:33.520 --> 00:00:37.520
the text
searchable. In Acrobat Pro you can also

00:00:37.520 --> 00:00:42.079
edit the text. Taking the extra minute or
two to create

00:00:42.079 --> 00:00:46.559
OCR pdfs can help to increase the speed
of research

00:00:46.559 --> 00:00:51.120
and learning. It should be used for all
pdfs

00:00:51.120 --> 00:00:54.239
used for online or blended learning
courses

00:00:54.239 --> 00:00:59.120
especially since OCR helps make files
more ADA accessible

00:00:59.120 --> 00:01:03.120
for those using a screen reader. I should
mention

00:01:03.120 --> 00:01:10.240
that you can only create OCR pdfs
using the Acrobat Pro DC software.

00:01:10.240 --> 00:01:13.680
You can't use Acrobat Reader to create
these files

00:01:13.680 --> 00:01:17.600
but of course you can open the files and
search for text within

00:01:17.600 --> 00:01:21.680
the Acrobat Reader software. At Rowan
University

00:01:21.680 --> 00:01:27.439
Acrobat Pro DC is available at no
charge to students and staff through

00:01:27.439 --> 00:01:32.320
Rowan University's
Virtual Desktop service. To learn more

00:01:32.320 --> 00:01:38.560
search Rowan University Virtual Desktop
or watch our tutorial using the Rowan

00:01:38.560 --> 00:01:41.840
Virtual Desktop service.

00:01:43.920 --> 00:01:47.680
As an example of OCR we have this
historical document

00:01:47.680 --> 00:01:51.600
related to the Glassboro Summit.

00:01:51.600 --> 00:01:55.439
Within the document we can search to see if the word Glassboro

00:01:55.439 --> 00:01:58.960
was used. On a Mac I'm selecting the
command key

00:01:58.960 --> 00:02:03.200
and the letter f to open the find
tool.

00:02:03.200 --> 00:02:08.399
On Windows you select the control key
and the letter f.

00:02:08.399 --> 00:02:14.879
As we can see the word Glassboro appears in several locations.

00:02:19.680 --> 00:02:24.879
Using OCR and the search or find tool
can help us quickly locate relevant

00:02:24.879 --> 00:02:29.040
information in the document.
This can help speed up your research and

00:02:29.040 --> 00:02:33.440
also locate
previously unknown information.

00:02:33.440 --> 00:02:36.640
Though converting scanned documents
using OCR

00:02:36.640 --> 00:02:42.640
is very helpful, it is not foolproof.
The software doesn't always recognize

00:02:42.640 --> 00:02:45.680
all the text
so keep that in mind when using this

00:02:45.680 --> 00:02:49.680
feature.
Just remember the cleaner the scan or

00:02:49.680 --> 00:02:53.440
the cleaner the text,
the easier it is for the software to

00:02:53.440 --> 00:02:58.319
accurately convert the text
using OCR.

00:02:58.400 --> 00:03:02.159
to show you the steps in creating an OCR
document

00:03:02.159 --> 00:03:07.519
we'll use this scanned historical survey
of Southern New Jersey's glass making

00:03:07.519 --> 00:03:11.599
factories,
which is a little over 500 pages. This

00:03:11.599 --> 00:03:15.280
document includes
text that reads both left to right and

00:03:15.280 --> 00:03:18.879
some pages are rotated 90 degrees which
Acrobat

00:03:18.879 --> 00:03:24.239
will process. The document also includes
photographs,

00:03:24.239 --> 00:03:29.440
maps, diagrams, and tables of text.

00:03:30.879 --> 00:03:34.400
Some of the text is very small and in
some areas

00:03:34.400 --> 00:03:39.599
the OCR software won't recognize
and convert the text. But for the most

00:03:39.599 --> 00:03:45.440
part it will do a good job.
To convert the document to OCR we're

00:03:45.440 --> 00:03:49.599
going to
select the Scan and OCR button located

00:03:49.599 --> 00:03:53.680
on the right
within the Tools Pane.

00:03:53.680 --> 00:03:58.239
If you don't see the Tools Pane go to
the Top Menu

00:03:58.239 --> 00:04:05.439
and select View. From here select
Show /Hide and then Tools Pane.

00:04:05.439 --> 00:04:09.280
You can also access the Scan and OCR
tool

00:04:09.280 --> 00:04:17.359
by going to the View Menu then Tools
then Scan and OCR, and open.

00:04:17.359 --> 00:04:23.680
Once the Scan and OCR tools open,
the toolbar near the top will change

00:04:23.680 --> 00:04:29.120
providing access to additional
Scan and OCR options. Select

00:04:29.120 --> 00:04:35.600
Recognize Text and then In this File.
Confirm that you want to convert all

00:04:35.600 --> 00:04:40.240
pages and be sure to select the correct
language for the document.

00:04:40.240 --> 00:04:43.680
For example, if the document is in US
English,

00:04:43.680 --> 00:04:47.680
select that option. If the document is in
Japanese,

00:04:47.680 --> 00:04:53.280
select that option. The software does not
translate to other languages. It will

00:04:53.280 --> 00:04:59.840
only convert the existing language.
We also recommend selecting the Settings

00:04:59.840 --> 00:05:03.680
option, where you can downsample the resolution.

00:05:03.680 --> 00:05:06.960
For example, if the file will be shared
online

00:05:06.960 --> 00:05:10.800
you might consider downsampling it to
150

00:05:10.800 --> 00:05:16.080
or 72 dpi. Of course,
before uploading the document check the

00:05:16.080 --> 00:05:21.120
image quality.
For online courses I recommend 600 or

00:05:21.120 --> 00:05:26.400
300 dpi. Once you have selected your
options

00:05:26.400 --> 00:05:32.639
press the Recognize Text button. This
document is about 519 pages

00:05:32.639 --> 00:05:36.080
plus it was scanned at a fairly high
resolution

00:05:36.080 --> 00:05:42.160
so it will take several minutes to
process. You can see the status here.

00:05:45.759 --> 00:05:48.880
I should also mention that the OCR
software

00:05:48.880 --> 00:05:53.840
will automatically straighten your pages.

00:05:56.960 --> 00:06:00.080
Converting the text and processing the
pages

00:06:00.080 --> 00:06:03.280
can use a lot of RAM, so you might want
to take a break

00:06:03.280 --> 00:06:08.639
while it's processing.
Once your document has finished

00:06:08.639 --> 00:06:14.319
processing you can save your file
but be sure to save your OCR file with a

00:06:14.319 --> 00:06:17.440
different file name
so that you don't overwrite your

00:06:17.440 --> 00:06:21.919
original file.
Mistakes can happen so it's best to not

00:06:21.919 --> 00:06:34.560
save
over your original file.

00:06:34.560 --> 00:06:38.080
After saving the file, let's search the
word Glassboro

00:06:38.080 --> 00:06:46.840
to see if Glassboro had any glass making

00:06:46.840 --> 00:06:50.639
factories.
As you can see Glassboro had an

00:06:50.639 --> 00:06:54.400
extensive and early history in glass making.

00:06:54.400 --> 00:06:58.560
To learn more about this topic,
check out our Glassboro Memory Mapping

00:06:58.560 --> 00:07:01.120
project.
The link is available in the description

00:07:01.120 --> 00:07:03.840
below.

00:07:04.000 --> 00:07:07.919
Creating OCR pdfs can help you with your
research

00:07:07.919 --> 00:07:12.800
by allowing you to search the text
within a document.

00:07:12.800 --> 00:07:16.080
Making your documents OCR compatible
will

00:07:16.080 --> 00:07:21.120
also make your pdfs more accessible to
those who are visually impaired

00:07:21.120 --> 00:07:24.880
and need a screen reader to access the
information.

00:07:24.880 --> 00:07:30.240
Though creating OCR pdfs take a little
extra time to create,

00:07:30.240 --> 00:07:34.400
doing so can be extraordinarily
beneficial for your students

00:07:34.400 --> 00:07:38.800
and future researchers. In our next
Acrobat tutorial

00:07:38.800 --> 00:07:44.000
we will explore how to use Acrobat to
make your files more accessible. Thank

00:07:44.000 --> 00:07:49.120
you for watching
and don't forget to subscribe to our

00:07:56.840 --> 00:07:59.840
channel.