VCDBLOAD

VCDBLOAD loads the contents of what you have in your version-control system into a SQL Server database and creates a full-text index, so that you can easily search your source code.

Contents:
   Command-line syntax
   Files
   Tables and Columns
   Using the Database

Command-line syntax

vcdbload {-config config-file | -VC VC-path }
         -database database [-Server Server] [-User User] [-Password Pwd]
         [-progress n] [-resume]
‑config Config-file that determines which version-control directories that VCDBLOAD is to load. With VCDBLOAD, any version specification in the config-file is ignored and VCDBLOAD always loads the most recently checked-in version of each file. You must specify one of ‑config and ‑VC. You cannot specify both,
‑VC Specifies a VC-path for a singular version-control directory for VCDBLOAD to load from.
‑database Database where to load the data. If the database does not exist, VCDBLOAD creates the database. However, you are recommended to create the database in advance and set an appropriate original size for the data and log file as VCDBLOAD creates the database with the default settings. It is not recommendable to provide a database that is used for other purposes. Note that the database cannot be tempdb (fulltext is not supported in tempdb).
  ‑database is mandatory.
‑Server Server where database resides. Default is the default instance on the local server.
‑User Which user to connect as. Default is Windows authentication.
‑Password Password when ‑User is specified.
‑progress Prints a progress message after the processing of each n files.
‑resume Continues an interrupted load. Or more precisely, it starts from the beginning, but only loads files not already loaded.

Files

VCDBLOAD loads files specified by ‑config and ‑VC with two exceptions:

These are the same exceptions as for SSGREP and SSREPLACE.

Tables and Columns

VCDBLOAD creates two tables:

 CREATE TABLE dbo.files
     (fileid   int IDENTITY PRIMARY KEY,
      name     nvarchar(400) COLLATE Latin1_General_CI_AS NOT NULL UNIQUE,
      isbinary bit          NOT NULL,
      regdate  datetime2(3) NOT NULL DEFAULT getdate())

CREATE TABLE dbo.lines
     (ident     int  NOT NULL IDENTITY,
      fileid    int  NOT NULL REFERENCES $TBL_FILES,
      linenum   int  NOT NULL,
      codeonly  nvarchar(4000) COLLATE Latin1_General_CS_AS NULL,
      alltext   nvarchar(4000) COLLATE Latin1_General_CI_AS NOT NULL,
      CONSTRAINT pk_lines PRIMARY KEY (ident),
      UNIQUE (fileid, linenum)
)

The table files holds the name of all files loaded with an id for the files. The column isbinary tracks if the file is binary. Binary files are not stored in lines, but this column can help you to track whether files have been marked as binary in source control by mistake.

The table lines holds the source code, with each line in the file on a separate row. linenum is the line number in the file. The contents is stored in two columns: codeonly is the code without comments, for languages where AbaPerls knows how strip out comments. The column alltext is the line as-is, that is, with comments retained.

VCDBLOAD also creates a fulltext catalog ft_catalog, a fulltext index on the columns codeonly and alltext. Finally, VCDBLOAD also creates a table type and a stored procedure, but that is only for its own internal use.

When you run VCDBLOAD against a database with existing data, it drops and recreates all objects, save the full-text catalog, unless you specify the ‑resume option.

Note SQL Server builds the fulltext index asynchronously, and it may take some time before the fulltext is fully usable. You can use this query to check:

SELECT * FROM sys.fulltext_indexes

As long as the column has_crawl_completed is 0 and the column_end_completed is NULL, the index is still under construction.

Using the Database

To use the database efficiently, you need to use the full-text predicates CONTAINS and FREETEXT. For a complete treatise on what is possible to do with CONTAINS and FREETEXT, please see Books Online. Here follows only a basic introduction (which only looks at CONTAINS):

A typical query would look like this:

SELECT f.name, l.linenum, l.alltext
FROM   files f
JOIN   lines l ON f.fileid = l.fileid
WHERE  CONTAINS(l.codeonly, 'ins_reactivate_sp')
ORDER BY f.name

The query joins files and lines, since you probably want to see the file name. The query also returns the line number, and the complete source line with comments included. However, the CONTAINS predicate searches codeonly to only get hits in the actual source code. In this example the search is for occurrances of the stored procedure ins_reactivate_sp.

If you want to search for more than one string, you can use OR:

SELECT f.name, l.linenum, l.alltext
FROM   files f
JOIN   lines l ON f.fileid = l.fileid
WHERE  CONTAINS(l.codeonly, 'ins_reactivate_sp OR secbas_acc_reactivate_sp')
ORDER BY f.name

If you want to search for a sequence of words, you need to enclose the string in double quotes:

SELECT f.name, l.linenum, l.alltext
FROM   files f
JOIN   lines l ON f.fileid = l.fileid
WHERE  CONTAINS(l.codeonly, '"UPDATE instruments"')
ORDER BY f.name

Overall it is never wrong to include the search term in double quotes. Note that when you use OR, you have two search terms, so the correct use in the example with OR above is

WHERE  CONTAINS(l.codeonly, '"ins_reactivate_sp" OR "secbas_acc_reactivate_sp"')

With fulltext, the wildcard is *. This is a case where you need to use double quotes:

SELECT f.name, l.linenum, l.alltext
FROM   files f
JOIN   lines l ON f.fileid = l.fileid
WHERE  CONTAINS(l.codeonly, '"ins_reactivate*"')
ORDER BY f.name

Note that in difference to LIKE, * does not work well at the beginning of the string. This will not find occurrances of ins_reactivate_sp:

SELECT f.name, l.linenum, l.alltext
FROM   files f
JOIN   lines l ON f.fileid = l.fileid
WHERE  CONTAINS(l.codeonly, '"*reactivate*"')
ORDER BY f.name

This is due to the fact that full-text extract whole words from the corpus and builds an index on them.

One thing to keep in mind is that full-text is always case-insensitive. If this gives too many false positive, for instance hits in string literals, add a LIKE clause:

SELECT f.name, l.linenum, l.alltext
FROM   files f
JOIN   lines l ON f.fileid = l.fileid
WHERE  CONTAINS(l.codeonly, 'instruments')
  AND  l.codeonly LIKE '%instruments%'
ORDER BY f.name

This works because the column codeonly has a case-sensitive collation. Note that alltext is case-insensitive.

Full-text is designed for search text, not source code, and therefore handles inflexions. For instance, a search on goose may yield a hit on geese. To avoid this as much as possible, the fullext index for codeonly is set to language-neutral. However, for alltext, the language is assumed to be US English. If you get false positives, use LIKE as above.

Another potential problem is word-breaking. The string job_update_sp is one single keyword for full-text, but the string job-update-sp falls into three. The same applies to strings with $ and other characters in them. This can have some unexpected effects. For instance above we had this example:

WHERE  CONTAINS(l.codeonly, '"UPDATE instruments"')

This will also give hits for UPDATE #instruments and UPDATE @instruments. Such false positives can be more difficult to filter out with LIKE.