SharePoint 2010 with OCR
Original Article in www.sharepointpt.org Portuguese Format written by me
In this article I will address a simple way to get images from SharePoint and process OCR using the Tessnet2 OCR. NET 2.0 assembly OCR.
OCR is an English acronym for Optical Character Recognition, a technology to recognize characters from an image file, or bitmap. Through the OCR is able to scan a sheet of printed text and get an editable text file.
The Tessnet2 need a folder to Core Processing Libraries in this case I have English and Portuguese. We also have to add the 64-bit DLL to project, since I'm using SharePoint 2010
In the first part of this article will render a SharePoint Document List and I will put them on the hard drive in"c:\temp images"
The SharePoint Process
I call your attention because I’m processing the information immediately after the foreach but if we want to control whether the document is online or not we have to use the switch included in the procedure.
string ImagePath = @"c:\temp\images\";
SPSite mysite = new SPSite(“SPSite”);
SPWeb myweb = mysite.OpenWeb();
SPFolder mylibrary = myweb.Folders[“SPList”];
SPFileCollection files = mylibrary.Files;
foreach (SPFile item in files)
byte binfile2 = item.OpenBinary();
FileStream fstream = new FileStream(ImagePath + item.Name,
fstream.Write(binfile2, 0, binfile2.Length);
catch (Exception ex)
I am using a method that returns a StringBuilder because it is much faster than an Array  String and pass the path to the image.
The method takes word by word to a StringBuilder that I add a "space" after each word.
The method removes some garbage RemoveDiacriticals (diacritics) OCR
General method for OCR processing
private StringBuilder ProcessOcr(string imagePath)
StringBuilder sb = new StringBuilder();
using (Bitmap image = new Bitmap(imagePath))
using (tessnet2.Tesseract tessocr = new tessnet2.Tesseract())
tessocr.Init(@"c:\temp\tessdata", "por", false);
List<tessnet2.Word> result = tessocr.DoOCR(image, Rectangle.Empty);
foreach (tessnet2.Word word in result)
sb.Append(RemoveDiacriticals(word.Text) + " ");
private string RemoveDiacriticals(string txt)
string nfd = txt.Normalize(NormalizationForm.FormD);
StringBuilder retval = new StringBuilder(nfd.Length);
foreach (char ch in nfd)
if (ch >= '\u0300' && ch <= '\u036f') continue;
if (ch >= '\u1dc0' && ch <= '\u1de6') continue;
if (ch >= '\ufe20' && ch <= '\ufe26') continue;
if (ch >= '\u20d0' && ch <= '\u20f0') continue;
Now go to the directory where I put the pictures taken from SharePoint, in this example I'm just processing. Jpg and remove the OCR text
Use GC.Collect() in order to release memory
private string VamosNessa()
DirectoryInfo di = new DirectoryInfo(ImagePath);
FileInfo rgFiles = di.GetFiles("*.jpg");
foreach (FileInfo fi in rgFiles)
If you want to upload the OCR to a field in a list we need to know the document link in SharePoint, we can keep him in one of the previous methods, then I will checkout (), Update and CheckIn (), be sure to check your SPCheckOutType, because we do not want to touch anything that is not approved or not is up to you.
We will use two fields, a Bool that tells me if the OCR is processed and a MultiText to put the OCR.
item["OCR"] = VamosNessa();
item["BOOL"] = "1";
Example Link Clik
This method works best with LETTER image Formats, also suggest creating a service that processes this information, as this process is synchronous
| ||João Tito Lívio |
Microsoft Most Valuable Professional Office Systems desde 2002