2016-03-21

Process HTML table by HTMLAgility Pack

Ocassionaly I need to deal with data inside a html table. For example, a report generated from Salesforce: though it is in ".xls" format, it is actually a html file. From what I tried, HTMLAgility pack can do quite a lot jobs for you, as demonstrated below.

Firstly I need a report from Salesforce. To do so, I log into my developer account, and navigate to Reports -> Account and Contact Report. From here I select New Accounts report, and press Export Details button to save the report into my local drive.




After save the report, when we open the report, an error message pops up and tells us the file doesn't look like an Excel file.




Ignore the error by pressing Yes button, the file is opened, and everything seems to be fine.



Now close the file and open it by your favorate text editor. Now you see it is actually a html file.






To import such a file by SSIS is not a simple task, as the file cannot be used as an Excel Data Source. What I usually did, is using HTMLAgility  pack in C# to generate the data source.




As usual, the coding task itself is simple, understanding how the HTMLAgility pack deals with the document object is important. Basically it will treat every html tag as a node, then drill down to the lowest child node. After understand this point, it is just a matter of dealing exceptions, for instance, we do not want to process a report with column header only.  Here is the example of using HTMLAgility pack when I need to import Salesforce report

       string filepath = @"D:\sforcerpt\report1458540550328.xls";  
       if (File.Exists(filepath))  
       {  
         HtmlDocument doc = new HtmlDocument();  
         doc.Load(filepath);  
         DataTable dt = new DataTable();  
         foreach (HtmlNode n in doc.DocumentNode.ChildNodes)  
         {  
           if (n.Name == "table" && n.HasChildNodes && n.ChildNodes.Where(x => x.Name == "tr").Count() >= 2)  
           {  
             int NbrOfRows = n.ChildNodes.Where(x => x.Name == "tr").Count();  
             //column headers  
             HtmlNode headerRow = n.SelectSingleNode("/table[1]/tr[1]");  
             foreach (HtmlNode header in headerRow.ChildNodes)  
             {  
               dt.Columns.Add(header.InnerText);  
             }  
             //content rows  
             for (int i = 2; i <= NbrOfRows; i++)  
             {  
               string path = "/table[1]/tr[" + i.ToString() + "]";  
               HtmlNode contentRow = n.SelectSingleNode(path);  
               DataRow dr = dt.NewRow();  
               for (int j = 1; j <= contentRow.ChildNodes.Count; j++)  
               {  
                 dr[j - 1] = contentRow.ChildNodes[j - 1].InnerText;  
               }  
               dt.Rows.Add(dr);  
             }  
           }  
         }  
         foreach (DataRow r in dt.Rows)  
         {  
           Console.WriteLine("Account ID: {0}; Account Name: {1}", r["Account ID"].ToString(), r["Account Name"].ToString());  
         }  
         Console.ReadLine();  
       }