Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

Loading Hierarchical Data

Posted on February 2, 2012February 7, 2012 by slowder

Earlier this week the question came  up on how can we insert data into a table that has a hierarchical data structure.  If you’re not familiar with a hierarchical data structures, here’s a quick explanation.

Consider the staging table to the right.  This table is used to load new accounts into our system.  Each account has a unique identifier and a name.  Since one company can own another, we have to be able to show that in our system.  In this system we have parent companies, and children companies.  The children companies’ accounts are related to a parent account.

So in our destination table, this relationship would be enforced by adding a primary key on the AccountID, and then adding a foreign key on ParentAccountID to relate it to AccountID.

Fairly straightforward, right?

Well, when you’re pushing thousands of rows into the table at a time, you can’t control the order of the inserts.  As soon as you try to insert a record that has a ParentAccountID that hasn’t yet been defined with a row using that AccountID, you’re going to get an error message.

Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the FOREIGN KEY SAME TABLE constraint "fk_AccountBaseDestination_AccountBaseDestination__ParentAccountID_AccountID". The conflict occurred in database "RV_DataConversion_Stage", table "dbo.AccountBaseDestination", column 'AccountID'.
The statement has been terminated.

So now we’re to the problem at hand.  How can we make sure we don’t try to insert a row with a parentID that’s not defined?

Use A Cursor

Curser.It was proposed we use a cursor to run through the rows, grab a parent and all it’s children, and insert the parent first, then insert all it’s children.  Then return to the beginning of the set and repeat.  While that would avoid the foreign key violation, it would cost a lot of I/O (disk resources) to do it that way.

BTW: while this option was being presented, the presenter said we won’t use cursors to do this.  But at the same time he was advocating a row-by-row operation.

If you’ve known me for any length of time, you know my stance on row-by-row operations in SQL.  I’m not a fan.

 

Insert all the Records, Then Update the ParentAccountID

Another proposal was to insert all the records, and not pass in the ParentAccountID.  Then we could run an update statement against the Destination table where that parent had been defined.  If there were any rows that needed a ParentAccountID where that AccountID was not defined, we simply wouldn’t update that row.

Basically the update would be an update with an INNER JOIN, so we’d avoid the violation that way.

Not bad, but if the destination table was being used, users might see a ton of new accounts at the root level, not at whatever level they’re supposed to be at.  You also would require locks on the destination table during the update that might escalate from row locks, to page locks, to table locks.  You’d have to test a load where you were adding 25 – 50% new records.  So if your destination table had 100 records, test a load of 25 new records.  Then test a load of 50 records.  During the load watch what locks are being taken?

If you see it escalate to a table lock, you may not want to do that to your users, if they are on the destination table regularly.

Use a CTE

A rather interesting idea that came up was to write a CTE that would select the staging data and add on a LevelNumber column.  This column would indicate what level in the hierarchy the account was on.  For example, level 1 would be the root node, level 2 would be children of accounts on level 1, level 3 would be children of level 2, and so on.

Once we know what levels each account is on, we could load all the level 1 accounts, then all the level 2 accounts, etc.  Not a bad idea.  Especially if you have your insert has to go through an SSIS package, or web service to do the actual insert into the database, you could be sure you’re only adding records in a integrity-safe way!

This is the method we ended up going with, due to the fact the insert was going through a webservice.  But… there is another way.

WHILE EXISTS Loop

The last method we discussed was first, we insert all the root nodes (level 1 from the previous image), and then we use a WHILE EXISTS loop to insert children where their parent was defined.  The end result is very similar to the STE solution, in fact the costs are almost identical.  You have to “loop” through the code once for every level.  The difference here is in order to use the WHILE EXISTS method, you have to be able to run the actual INSERT statements.  And that’s why we didn’t chose this method this time.

In my next training article I’m going to go through this method, so you can see the design pattern.

Conclusion

In the end we went with the method that used the least impact on the server, hits all of the requirements, and gets the job done with the least effort.  There’s always a lot of back and forth in these design meetings.  It’s all a part of the process of designing good code.  You can’t get too invested in “my way is best, I won’t do it any other way.”  That’s just counter-productive.

You have to open your ideas up to scrutiny and participate in the give and take.  In the end, you have to work with these people long after the process has been implemented.  You’re going to have to go through another design meeting.  You don’t want to make the next one even harder because you had to have your way…right?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Docker-based Spark
  • Network Infrastructure Updates
  • Docker and Watchtower
  • Delta Sharing – Data Providers
  • SQL Server to Databricks Profiler

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2023 shannonlowder.com | Powered by Minimalist Blog WordPress Theme