Back to Experiments

AI

prototype

AI Video Depth Estimation

CLI Tool

CLI tool that generates depth-map videos from any footage using transformer-based monocular depth estimation, with grayscale + heatmap export modes.

Python
OpenCV
PyTorch
Transformers
PIL
tqdm
Depth-Est-1.jpg
Person in camouflage gear hiking through a forest with bare trees and a rugged landscape.
Apple Depth Pro map of a man in the woods

Apple Depth Pro Map

Spectral Heat Map of a man in the woods

Spectral Heatmap - Apple Depth Pro

Command-Line Interface + Export Settings

A CLI wrapper lets you run the tool from terminal with a video input, model size selection, and output styling (grayscale vs heatmap). It also auto-creates an exports/ folder and generates a smart default filename.

Python

1

def parse_args():

2

parser = argparse.ArgumentParser(

3

description="AI Video Depth Estimation using Depth Anything V2"

4

)

5

parser.add_argument("input", type=str, help="Path to input video file")

6

parser.add_argument("--model", type=str, default=DEFAULT_MODEL,

7

choices=list(MODEL_CONFIGS.keys()))

8

parser.add_argument("--output", type=str, help="Path to output video file")

9

 

10

group = parser.add_mutually_exclusive_group()

11

group.add_argument("--viz", action="store_true",

12

help="Output colorized depth heatmap (Inferno)")

13

group.add_argument("--grayscale", action="store_true", default=True,

14

help="Output raw grayscale depth (default)")

15

 

16

parser.add_argument("--spectral", action="store_true",

17

help="Use Spectral-like colormap (requires --viz)")

18

return parser.parse_args()
Video tools interface showing depth estimation options and input file selection for 3D mapping.

A user interface for a video tools toolbox displaying options for depth estimation, including a selection for an input video file and model size.

Device Selection + Model Loading (Hugging Face Transformers)

The script automatically chooses CUDA if available, then loads both:

  • an AutoImageProcessor for pre-processing frames
  • an AutoModelForDepthEstimation for inference
Python

1

device = "cuda" if torch.cuda.is_available() else "cpu"

2

print(f"Using device: {device}")

3

 

4

processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)

5

model = AutoModelForDepthEstimation.from_pretrained(model_id, trust_remote_code=True)

6

 

7

model.to(device)

8

model.eval()

Video Reader/Writer Pipeline (OpenCV)

The tool opens the source video with OpenCV, reads its properties, then creates an output writer that matches the input resolution + FPS.

Python

1

cap = cv2.VideoCapture(str(input_path))

2

w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))

3

h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

4

fps = cap.get(cv2.CAP_PROP_FPS)

5

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

6

 

7

fourcc = cv2.VideoWriter_fourcc(*"mp4v")

8

out = cv2.VideoWriter(str(output_path), fourcc, fps, (w, h))

Per-Frame Inference (Frame to Depth)

Each frame is:

  1. read from OpenCV (BGR)
  2. converted to RGB and wrapped as a PIL Image
  3. processed into tensors for the model
  4. inferred with torch.no_grad() for speed
Python

1

ret, frame = cap.read()

2

frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

3

image = Image.fromarray(frame_rgb)

4

 

5

inputs = processor(images=image, return_tensors="pt")

6

inputs = {k: v.to(device) for k, v in inputs.items()}

7

 

8

with torch.no_grad():

9

outputs = model(**inputs)

10

predicted_depth = outputs.predicted_depth

Depth Resizing + Normalization

The depth output is resized back to the original frame dimensions using bicubic interpolation, then normalized into an 8-bit range (0–255) for video encoding.

Python

1

prediction = torch.nn.functional.interpolate(

2

predicted_depth.unsqueeze(1),

3

size=(h, w),

4

mode="bicubic",

5

align_corners=False,

6

)

7

 

8

depth_output = prediction.squeeze().cpu().numpy()

9

 

10

# Normalize to 0255 for display/video output

11

d_min, d_max = depth_output.min(), depth_output.max()

12

norm = (depth_output - d_min) / (d_max - d_min + 1e-6)

13

depth_uint8 = (norm * 255).astype(np.uint8)

Interested in this experiment?

Let's discuss collaboration

Contact

Paul is focusing on the intersection of technology and creativity. He is a partner at Epicly AI, an AI-powered platform for creating engaging digital content and a Creative Director at  Omnislash Visual, a visual branding and design studio.

Email

driventocreatepodcast@gmail.com

Phone

203-693-1275