Recent Posts

Tags

News

  • Sentient

    My services as a speaker, consultant, trainer and developer are available through Sentient.

    Sentient is a leading provider of software development, consultancy and training services on the Microsoft platform.

    Please feel free to contact me if you have a question about something posted in my blog.

Community

Email Notifications

Links

Archives

Sentient thoughts about .NET

Jonathan Greensted's .NET weblog

December 2004 - Posts

Finding word styles fast

In a recent project we had a scenario where we had to find all the paragraphs in a particular style eg. Heading 1, Heading 2, etc. to build a custom table of contents.

My initial thought was to iterate over the document paragraphs looking for the particular style we were interested in.

The code looked like this:

 

private void FindStyleUsingParagraphIteration(Word.Document doc, string style)

{

    foreach (Word.Paragraph p in doc.Paragraphs)

    {

 

        Word.Style s = (Word.Style)p.Range.get_Style();

 

        if (s.NameLocal == style)

        {

            string text = p.Range.Text;

 

            int page = (int) p.Range.get_Information(Word.WdInformation.wdActiveEndPageNumber);

            Single vert = (Single) p.Range.get_Information(Word.WdInformation.wdVerticalPositionRelativeToPage);

 

            Message += string.Format("        Style: {0}, text: {1}, page: {2}, vert: {3}",

                s.NameLocal, p.Range.Text, page, vert.ToString());

        }

    }

}

 

This worked perfectly however it was a little slow so I looked for a better approach.

My second attempt was to use Word's built in Find functionality to find paragraphs in a particular style.

The code looked like this:

 

private void FindStyleUsingFind(Word.Document doc, string style)

{

    object oMissing = Type.Missing;

    object oTrue = true;

    object oStyle = style;

 

    Word.Range r = doc.Content;

 

    bool found = true;

    while (found)

    {

        try

        {

            r.Find.ClearFormatting();

            r.Find.set_Style(ref oStyle);

            found = r.Find.Execute(ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing,ref oMissing,

                ref oTrue, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing);

        }

        catch

        {

            found = false;

        }

        if (found)

        {

            string text = r.Text;

 

            int page = (int) r.get_Information(Word.WdInformation.wdActiveEndPageNumber);

            Single vert = (Single) r.get_Information(Word.WdInformation.wdVerticalPositionRelativeToPage);

 

            Message += string.Format("        Style: {0}, text: {1}, page: {2}, vert: {3}",

                oStyle.ToString(), r.Text, page, vert.ToString());

        }

    }

}


The performance of this approach was much better however it resulted in the Word status bar flickering as it performed the Finds so I started looking for a third option.

My third attempt was to use Word's built in XML support and write an xpath against the WordML schema. 

The code looks like this:

private void FindStyleUsingXPath(Word.Document doc, string style)

{

    try

    {

        string schema = "xmlns:w=\"http://schemas.microsoft.com/office/word/2003/wordml\"";

        string xPath = string.Format("//w:p[descendant::w:pStyle[@w:val='{0}']]/w:r/w:t", style);

        Word.XMLNodes nodes = doc.SelectNodes(xPath,schema,false);

 

        if (nodes != null)

        {

            foreach (Word.XMLNode node in nodes)

            {

                int page = (int) node.Range.get_Information(Word.WdInformation.wdActiveEndPageNumber);

                Single vPos = (Single) node.Range.get_Information(Word.WdInformation.wdVerticalPositionRelativeToPage);

                string text = node.Text;

 

                Message += string.Format("        Page (0), vert (1), (2)\n", page, vPos, text);

            }

        }

    }

    catch(Exception ex)

    {

        Message += ex.Message;

    }

}

Sadly this is where it all went wrong.  The code ran fine (without any exceptions being thrown) however it returned no results!

The first thing I noted was that it took longer to return no results than my Find attempt took to do the job properly so this isn't a good approach in terms of performance anyway however it was strange that it didn't return any results.  After some investigate I found that the wordml namespace was not registered in the doc.XMLSchemaReferences collection so I thought that maybe I should add it manually using the following code: 

object oNamespaceURI = "http://schemas.microsoft.com/office/word/2003/wordml";

object oAlias = "w";

object oFileLocation = @"C:\Program Files\Microsoft Office 2003 Developer Resources\Microsoft Office 2003 XML Reference Schemas\WordprocessingML Schemas\w10.xsd";

 

string schema = string.Format("xmlns:{0}=\"{1}\"",oNamespaceURI, oAlias);

 

doc.XMLSchemaReferences.Add(ref oNamespaceURI, ref oAlias, ref oFileLocation, false);

Unfortunately this code generates the following exception "This schema cannot be used because it attempts to declare a namespace reserved by Word." so that isn't the answer. 

Finally in frustration I emailed a buddy of mine on the Office team who confirmed definatively that SelectNodes(...) would only work custom schemas and not WordML.  So there you have it, don't waste your time like me trying to do this until at least the next version.  (I have submitted a feature request).

For those of you who are interested in the relative performance here are the stats for my test document:

      Using Paragraph iteration: 10.1445872 seconds
      Using Find: 1.9928656 seconds
      Using XPath: 4.9771568 seconds (failed)

My conclusion was to use the Find.Execute method and put up with the status bar flickering!

The source for this experimentation can be downloaded from the following page on the Sentient website:

      http://www.sentient.co.uk/wordFindStyle.aspx

If you know of a better way to do this please add a comment to this blog.

Posted: Dec 07 2004, 09:13 AM by jonathangreensted | with 4 comment(s)
Filed under: