◎AI

prototype

AI Video Depth Estimation

CLI Tool

CLI tool that generates depth-map videos from any footage using transformer-based monocular depth estimation, with grayscale + heatmap export modes.

Python

OpenCV

PyTorch

Transformers

PIL

tqdm

Person in camouflage gear hiking through a forest with bare trees and a rugged landscape.

Apple Depth Pro map of a man in the woods

Command-Line Interface + Export Settings

A CLI wrapper lets you run the tool from terminal with a video input, model size selection, and output styling (grayscale vs heatmap). It also auto-creates an exports/ folder and generates a smart default filename.

Python

1
def parse_args():
2
    parser = argparse.ArgumentParser(
3
        description="AI Video Depth Estimation using Depth Anything V2"
4
    )
5
    parser.add_argument("input", type=str, help="Path to input video file")
6
    parser.add_argument("--model", type=str, default=DEFAULT_MODEL,
7
                        choices=list(MODEL_CONFIGS.keys()))
8
    parser.add_argument("--output", type=str, help="Path to output video file")
9
 
10
    group = parser.add_mutually_exclusive_group()
11
    group.add_argument("--viz", action="store_true",
12
                       help="Output colorized depth heatmap (Inferno)")
13
    group.add_argument("--grayscale", action="store_true", default=True,
14
                       help="Output raw grayscale depth (default)")
15
 
16
    parser.add_argument("--spectral", action="store_true",
17
                        help="Use Spectral-like colormap (requires --viz)")
18
    return parser.parse_args()

Video tools interface showing depth estimation options and input file selection for 3D mapping.

Device Selection + Model Loading (Hugging Face Transformers)

The script automatically chooses CUDA if available, then loads both:

an AutoImageProcessor for pre-processing frames
an AutoModelForDepthEstimation for inference

Python

1
device = "cuda" if torch.cuda.is_available() else "cpu"
2
print(f"Using device: {device}")
3
 
4
processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)
5
model = AutoModelForDepthEstimation.from_pretrained(model_id, trust_remote_code=True)
6
 
7
model.to(device)
8
model.eval()

Video Reader/Writer Pipeline (OpenCV)

The tool opens the source video with OpenCV, reads its properties, then creates an output writer that matches the input resolution + FPS.

Python

1
cap = cv2.VideoCapture(str(input_path))
2
w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
3
h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
4
fps = cap.get(cv2.CAP_PROP_FPS)
5
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
6
 
7
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
8
out = cv2.VideoWriter(str(output_path), fourcc, fps, (w, h))

Per-Frame Inference (Frame to Depth)

Each frame is:

read from OpenCV (BGR)
converted to RGB and wrapped as a PIL Image
processed into tensors for the model
inferred with torch.no_grad() for speed

Python

1
ret, frame = cap.read()
2
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
3
image = Image.fromarray(frame_rgb)
4
 
5
inputs = processor(images=image, return_tensors="pt")
6
inputs = {k: v.to(device) for k, v in inputs.items()}
7
 
8
with torch.no_grad():
9
    outputs = model(**inputs)
10
    predicted_depth = outputs.predicted_depth

Depth Resizing + Normalization

The depth output is resized back to the original frame dimensions using bicubic interpolation, then normalized into an 8-bit range (0–255) for video encoding.

Python

1
prediction = torch.nn.functional.interpolate(
2
    predicted_depth.unsqueeze(1),
3
    size=(h, w),
4
    mode="bicubic",
5
    align_corners=False,
6
)
7
 
8
depth_output = prediction.squeeze().cpu().numpy()
9
 
10
# Normalize to 0–255 for display/video output
11
d_min, d_max = depth_output.min(), depth_output.max()
12
norm = (depth_output - d_min) / (d_max - d_min + 1e-6)
13
depth_uint8 = (norm * 255).astype(np.uint8)

Interested in this experiment?

Let's discuss collaboration

AI Video Depth Estimation

Command-Line Interface + Export Settings

Device Selection + Model Loading (Hugging Face Transformers)

Video Reader/Writer Pipeline (OpenCV)

Per-Frame Inference (Frame to Depth)

Depth Resizing + Normalization

Contact

Email

Phone

AI Video Depth Estimation

Command-Line Interface + Export Settings

Device Selection + Model Loading (Hugging Face Transformers)

Video Reader/Writer Pipeline (OpenCV)

Per-Frame Inference (Frame to Depth)

Depth Resizing + Normalization

.css-188onz3{font-size:var(--chakra-font-sizes-7xl);}@media screen and (min-width: 48rem){.css-188onz3{font-size:var(--chakra-font-sizes-10xl);}}@media screen and (min-width: 80rem){.css-188onz3{font-size:var(--chakra-font-sizes-10xl);}}Contact

Email

Phone

Contact