So What Does a Consensus Transcription Look Like, Exactly?

Providing an accurate transcription of the Thomas T. Eckert Papers is one of the primary goals of Decoding the Civil War. It’s why we applied to the National Historical Publications and Records Commission (NHPRC) for a grant, and it’s why our thousands of volunteers have put so much time and effort into this project. With more than 4,000 subjects, or pages, now retired, the folks at at Zooniverse have begun the process of establishing the consensus transcriptions, and we are excited and pleased with the results.

Each page, whether in the telegram ledgers or the codebooks, is seen by multiple people. The pages that have been “retired” are those that have been seen and classified by a sufficient number of people. To find the consensus transcription, an algorithm is run that compares every word and finds the most frequently used one. This doesn’t guarantee that the transcription is accurate, and we are allowing for corrections in the future, but it gives us a version of the text that most people agree on.

Let’s look, for example, at this telegram from the top of page 8, ledger 1:

mssec_01_010 top telegram.jpg

The consensus lines and box are fairly straightforward – they were made by averaging the locations drawn by everyone who classified this page. As you can see, there are some quirks, such as the weirdly short top line, which we have seen consistently throughout the reviewed data: some people underlined the entire top line, while others split it into two smaller lines. The lack of underlining with the second line of the telegram is a bit harder to parse out, but most people transcribed “sent”, so the effect on the transcription was minimal. The box comes from that last step of “boxing” the telegram. It will prove very useful in Phase 2 when we parse out individual telegrams, as we have done in the example above.

So, what does the consensus transcription look like for this telegram?! Without further ado….

Louisville 4′ Recd Feb Feb 4 ’62
Col Colburn asst adjt General Ocean
they had better not be sent
I may want them soon if
they are ready for service Alvord

Huzzah! With the exception of that duplicated “Feb”, this seems to be spot on.

Here’s a closer look at the breakdown of the responses:

mssec_01_010 top message consensus.jpg

The numbers underneath each word indicate the number of people who transcribed it that way. The Zooniverse team uses these numbers to calculate the reliability of each line and each page. This message comes from a page with a reliability value of .8658, an excellent value on the scale from .0 to 1.0. We are currently working on determining what is an acceptable base level, or floor, of reliability. Pages whose reliability is lower than that base level will have to be reviewed further, or placed back into the transcription queue.

Once we have an acceptably reliable consensus transcription for a ledger, we will load that transcription into the Huntington Digital Library, so that researchers can start using the materials in a keyword-searchable form. At the same time we will be loading individual telegrams into Phase 2 of Decoding the Civil War, in which volunteers will tag metadata such as sender, recipient, times sent and received, and more. The fruits of the volunteer labor of “boxing” the telegram come into play here with the consensus box helping us determine the correct location on the page of the telegram.

We are incredibly excited about our progress so far, and can’t wait to share more of our findings in the near future!

Advertisements

Tags: , , , ,

10 responses to “So What Does a Consensus Transcription Look Like, Exactly?”

  1. SarahTheEntwife says :

    Fascinating! It’s great to see some of the behind-the-scenes work 🙂

    Liked by 1 person

  2. Craig says :

    Thank you Kpeck for a great insight into how and where this is all going!

    Here’s some basic questions (naturally..!):

    1. How many total transcriptions were actually done on this example (top-Pg 8, ledger 1)? Eight is the highest number we see in the response breakdown. The stats website seems to suggest that “retirement” comes at 10 transcriptions – or am I reading that wrong?

    2. If only one transcriber noted “ ’62 ” = “apostrophe 62” how does it make the cut? Does the system somehow know the apostrophe should be there or does a human step in with a final decision?

    3. It just dawned on me – often times text boxes must, out of necessity, overlap – can the system deal with that or should we adjust something, somehow?

    Many thanks again, Craig

    Liked by 1 person

    • katecpeck says :

      I would be concerned if we didn’t have questions from you, Craig 🙂

      Some answers, off the top of my head:

      1) Retirement does come after 10 “classifications”, as Zooniverse refers to each volunteer’s interaction with a single subject, but not everyone transcribes every word on every page. Some skip words that are hard, some just miss text that is outside of the immediate window. I think we’ve all had that moment when we hit “Done”, only to realize that we missed half a page but have no way to go back and fix the omission. It’s part of why we set the retirement limit so high.

      2) The transcription that I shared above is the raw product of the consensus algorithm, which I should point out is a work in progress. We are still working out what to do with messages that have low reliability scores, and it’s even trickier with single words with low scores. We may have to build in a spot checking system that we can employ before loading ledgers into the Huntington Digital Library, but that’s still on the horizon.

      3) The consensus data that I’ve seen on the text boxes has been pretty solid, although those have been rather straightforward examples. We’ll have to see what happens with the messages that are squashed together. Before we sent the images off to Zooniverse I mapped out all of the telegrams in all of the ledgers, so we know how many are on each page, and should be able to check that against the number of boxes that our volunteers have drawn. No need to adjust what you’re doing.

      Hopefully that answers most of your questions! Let me know when more pop up 🙂

      Like

  3. Marlys Sebasky says :

    Neat to know all that, thank you!

    On Tue, Nov 1, 2016 at 1:02 PM, Decoding the Civil War wrote:

    > katecpeck posted: “Providing an accurate transcription of the Thomas T. > Eckert Papers is one of the primary goals of Decoding the Civil War. It’s > why we applied to the National Historical Publications and Records > Commission (NHPRC) for a grant, and it’s why our thousands of” >

    Liked by 1 person

  4. Craig says :

    Hi Kate, Thanks for the amazingly quick turnaround time that is so illustrative of this project – a devoted Zooniverse/Huntington team that interacts so well with volunteers, is so responsive and willing to share their knowledge.

    So here is another little question: I, and no doubt many other transcribers, will spend an embarrassingly long time trying to make sense out of a seemingly mishmash of characters that are meant to refer to an obscure village, creek, river, etc. It’s a great feeling of accomplishment to solve these riddles but Is there any real sense in spending the time if the correct answer we find can be overruled by a wrong consensus?

    Also, is that time spent contradictory to the immediate goal of the project which is to get everything classified to an acceptable “reliability value” and then later on get down to fine turning?

    Best regards,
    Inquisitive

    Liked by 1 person

    • katecpeck says :

      That’s an interesting problem Craig, and I don’t know that I can advise people one way or another. Obviously we want the consensus transcription to be as accurate as possible, but we are slaves to the algorithm for now. I don’t think that extra effort is wasted, since I suspect that some people skip over the words they can’t read or mark them as “unclear”, in which case the consensus will be based on far fewer transcriptions. Solving those mysteries will probably also help you familiarize yourself with the writing, making future transcription easier.
      Although our tangible institutional goal is full transcriptions of the materials, giving people an opportunity to interact and make a connection with the past is definitely one of our philosophical goals. The awesome efforts that you and other volunteers have demonstrated in your diligence is evidence that we are accomplishing that goal!

      Like

  5. Craig says :

    Thank you Kate and “Yes!” to familiarization and philosophical goals.
    It sounds like the algorithm might automatically reject “unclears” which could mean we should make “best guesses”? Except, of course, in really impossible situations.
    RE – your earlier post: I believe anyone who has already mapped out all 9,664 (???) telegrams is a candidate for some sort of Nobel Prize?

    Liked by 1 person

    • Craig says :

      I should clarify: I now mark “best guesses” as “unclear”. Is it best to make the guess and not mark it which gives the algorithm at least something to work with?

      Craig

      Liked by 1 person

    • marioeinaudi says :

      Hello Craig, Thank you for recognizing Kate’s hard work! It is always wonderful to see a colleague’s hard work recognized by others. Just as an FYI, Kate actually counted 15,922 telegrams across the 35 ledgers and letterpress books. We are indebted to Kate for her amazing effort!

      Liked by 1 person

  6. Craig says :

    Thank you marioeinaudi for your mind-boggling update to the number 15,922!

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: