Thursday, June 26, 2014

How to speed up a Proc SQL sequence with Proc Sorts & Data merge



  • 8. Re: A join in proc sql takes 5 hours and consumes a lot of disk space
    SergioSanchezNewbie

    The data are in an Oracle Server and I make a copy of the datasets connecting to the server through a libname statement and after

    data a;
    set b (where = (var1<= date and var2>date and var3>date); where "b" is the dataset on the Oracle server

    I do a left join, the target in most of the cases is to obtain the surrogate key of the second table and one or two variables more.Something like this

    proc sql;
    create table x as
    select a.*, b.var1, b.var2, b.var3
    from dataset1 as a left join dataset2 as b
    on (a.var1 = b.var1);
    quit;

    There is no index at all in any of the datasets and no message appears in the log about an issue in the performance


  • Helpful Answer9. Re: A join in proc sql takes 5 hours and consumes a lot of disk space
    KurtBremserApprentice
    Try this:

    proc sort
      data=b /* this is your original oracle data set */
        (where = (var1<= date and var2>date and var3>date)
      out=dataset1
    ;
    by var1;
    run;

    proc sort
      data=dataset2 (keep=var1 var2 var3)
      out=data2x
    ;
    by var1;
    run;

    data x;
    merge
      dataset1 (in=a)
      data2x
    ;
    by var1;
    if a;
    run;

    Compare this method and the SQL method using options fullstimer;
    Also watch the disks while the jobs are running; you may be surprised by the disk usage(s).

    I remember when I first came across a piece of code done by a SAS consultant that had > 100 lines. I quickly saw that I could do the same in one create table with ~ 10 lines in PROC SQL, so why bother with all that code? Then I had to wait 5 hours for my SQL to finish, while his code took about 20 minutes to produce the same result. With less than half the disk space.

No comments:

Post a Comment