SSIS script task to verify table full of URL's

278 Views Asked by At

I am going to apologize ahead time in saying that I am pretty new to SSIS but I've been assigned a project that involves connecting to a table and verifying URL's that part of a products 'attributes'. The verification would basically be looking for something like 200=good and 404=bad and loading the ProductID and the results into either a goodURL or badURL table.

I did read one post that got me in the ballpark but it fell all apart as it was dealing with a specific URL and not thousands of them. So if anyone could point me in the right direction of how to set up the variables and returning the results from the 'HEAD' (from some posts i've read) method of retrieving URL connection info....it would be greatly appreciated.

so i have a SQL task that returns over 1000 records. ProductID, URL_Type, and URL. I then have a script task that i found the code for on stack that i would like to read the individual records and kick out the ProductID, URL_Type, URL, and a new column URL_Valid (true or false). here is the code for script task:

    public void Main()
    {
        Dts.Variables["Found"].Value = UrlFound(Dts.Variables["URLCheck"].Value.ToString());
        Dts.TaskResult = (int)ScriptResults.Success;
    }
    private bool UrlFound(string url)
    {
        try
        {
            //Creating the HttpWebRequest
            HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
            //Setting the Request method HEAD, you can also use GET too.
            request.Method = "HEAD";
            //Getting the Web Response.
            HttpWebResponse response = request.GetResponse() as HttpWebResponse;
            //Returns TRUE if the Status code == 200
            response.Close();
            return (response.StatusCode == HttpStatusCode.OK);
        }
        catch
        {
            return false;
        }

    }

I am just trying to get it to work with just the URL and kick out a true or false right now and that isn't even working. trying to start off small then get to the wanted result at the end. not sure i understand the parameter mapping from a full result set being fed into a script task and how to return results that i want. again, i am new but i am trying to get there. SSIS is the only tool i have to work with on this project. thanks ahead of time.

basic flow

1

There are 1 best solutions below

7
On

As mentioned in my comment, I would load the URLs in a Dataflow. Then after the source (e.g. OleDB Source) you should add a Script Component as Transformation. This Script Component should take the URL as Input column and should have two Outputs - ValidUrl and FaultyUrl or similar - which are configured as asynchronous. Then - within this component - you can overwrite the following method:

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string currUrl = Row.URL;
    int statusCode = -1;

    if (!String.IsNullOrEmpty(currUrl))
    {
        try
        {
            statusCode = -1;

            HttpWebRequest request = WebRequest.Create(currUrl) as HttpWebRequest;
            request.Method = "HEAD";
            HttpWebResponse response = request.GetResponse() as HttpWebResponse;

            statusCode = (int)response.StatusCode;
            if (statusCode == 200)
            {
                UrlValidBuffer.AddRow();
                UrlValidBuffer.URL = currUrl;
                UrlValidBuffer.Code = statusCode;
            }
            else
            {
                UrlFaultyBuffer.AddRow();
                UrlFaultyBuffer.URL = currUrl;
                UrlFaultyBuffer.Code = statusCode;
            }
        }
        catch (Exception e)
        {
            ComponentMetaData.FireWarning(10, "Loading URL", "URL " + currUrl + " failed to open: " + e.Message, "", 0);
            UrlFaultyBuffer.AddRow();
            UrlFaultyBuffer.URL = currUrl;
            UrlFaultyBuffer.Code = statusCode;
        }
    }
    else
    {
        ComponentMetaData.FireWarning(10, "Loading URL", "URL empty:" + currUrl, "", 0);
    }
}

However, faulty URLs will already throw an exception during the http request - you will not get any further information from the response. In order to extract article info or whatever, you will have to extract it from the URL, not from the response.